Related papers: Data Repetition Beats Data Scaling in Long-CoT Supervised Fine-Tuning

Data Repetition Beats Data Scaling in Long-CoT Supervised Fine-Tuning

URL: http://arxiv.org/abs/2602.11149v1
Date: Wed, 11 Feb 2026 18:58:54 GMT
Title: Data Repetition Beats Data Scaling in Long-CoT Supervised Fine-Tuning
Authors: Dawid J. Kopiczko, Sagar Vaze, Tijmen Blankevoort, Yuki M. Asano,
Abstract summary: Olmo3-7B trained for 128 epochs on 400 samples outperforms the equivalent 1 epoch on 51200 samples by 12-26 percentage points.<n>We find that training token accuracy reliably signals when repetition has saturated.
Score: 43.11305591635628
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Supervised fine-tuning (SFT) on chain-of-thought data is an essential post-training step for reasoning language models. Standard machine learning intuition suggests that training with more unique training samples yields better generalization. Counterintuitively, we show that SFT benefits from repetition: under a fixed update budget, training for more epochs on smaller datasets outperforms single-epoch training on larger datasets. On AIME'24/25 and GPQA benchmarks, Olmo3-7B trained for 128 epochs on 400 samples outperforms the equivalent 1 epoch on 51200 samples by 12-26 percentage points, with no additional catastrophic forgetting. We find that training token accuracy reliably signals when repetition has saturated; improvements from additional epochs plateau at full memorization, a pattern consistent across all settings. These findings provide a practical approach for reasoning SFT, where scaling epochs with token accuracy as a stopping criterion can replace expensive undirected data scaling. We pose the repetition advantage, where full memorization coincides with improved generalization, as a new open problem for the community in understanding the training dynamics of large language models.

Related papers

What Do Learning Dynamics Reveal About Generalization in LLM Reasoning? [83.83230167222852]
We find that a model's generalization behavior can be effectively characterized by a training metric we call pre-memorization train accuracy. By connecting a model's learning behavior to its generalization, pre-memorization train accuracy can guide targeted improvements to training strategies.
arXiv Detail & Related papers (2024-11-12T09:52:40Z)
SwiftLearn: A Data-Efficient Training Method of Deep Learning Models using Importance Sampling [3.8330834108666667]
We present SwiftLearn, a data-efficient approach to accelerate training of deep learning models. This subset is selected based on an importance criteria measured over the entire dataset during warm-up stages. We show that almost 90% of the data can be dropped achieving an end-to-end average speedup of 3.36x while keeping the average accuracy drop less than 0.92%.
arXiv Detail & Related papers (2023-11-25T22:51:01Z)
D4: Improving LLM Pretraining via Document De-Duplication and Diversification [38.84592304799403]
We show that careful data selection via pre-trained model embeddings can speed up training. We also show that repeating data intelligently consistently outperforms baseline training.
arXiv Detail & Related papers (2023-08-23T17:58:14Z)
NLU on Data Diets: Dynamic Data Subset Selection for NLP Classification Tasks [0.0]
Finetuning large language models inflates the costs of NLU applications. Recent works in computer vision use data pruning to reduce training time. We propose a curriculum which periodically scores and discards unimportant examples during finetuning.
arXiv Detail & Related papers (2023-06-05T19:30:41Z)
Scaling Data-Constrained Language Models [133.2083255645999]
We investigate scaling language models in data-constrained regimes.<n>We find that with constrained data for a fixed compute budget, training with up to 4 epochs of repeated data yields negligible changes to loss compared to having unique data.<n>We propose and empirically validate a scaling law for compute optimality that accounts for the decreasing value of repeated tokens and excess parameters.
arXiv Detail & Related papers (2023-05-25T17:18:55Z)
Test-Time Prompt Tuning for Zero-Shot Generalization in Vision-Language Models [107.05966685291067]
We propose test-time prompt tuning (TPT) to learn adaptive prompts on the fly with a single test sample. TPT improves the zero-shot top-1 accuracy of CLIP by 3.6% on average. In evaluating cross-dataset generalization with unseen categories, TPT performs on par with the state-of-the-art approaches that use additional training data.
arXiv Detail & Related papers (2022-09-15T17:55:11Z)
Dataset Pruning: Reducing Training Data by Examining Generalization Influence [30.30255670341501]
Do all training data contribute to model's performance? How to construct a smallest subset from the entire training data as a proxy training set without significantly sacrificing the model's performance?
arXiv Detail & Related papers (2022-05-19T05:36:35Z)
Predicting Training Time Without Training [120.92623395389255]
We tackle the problem of predicting the number of optimization steps that a pre-trained deep network needs to converge to a given value of the loss function. We leverage the fact that the training dynamics of a deep network during fine-tuning are well approximated by those of a linearized model. We are able to predict the time it takes to fine-tune a model to a given loss without having to perform any training.
arXiv Detail & Related papers (2020-08-28T04:29:54Z)
Omni-supervised Facial Expression Recognition via Distilled Data [120.11782405714234]
We propose omni-supervised learning to exploit reliable samples in a large amount of unlabeled data for network training. We experimentally verify that the new dataset can significantly improve the ability of the learned FER model. To tackle this, we propose to apply a dataset distillation strategy to compress the created dataset into several informative class-wise images.
arXiv Detail & Related papers (2020-05-18T09:36:51Z)

This list is automatically generated from the titles and abstracts of the papers in this site.