Midtraining Bridges Pretraining and Posttraining Distributions
- URL: http://arxiv.org/abs/2510.14865v1
- Date: Thu, 16 Oct 2025 16:39:52 GMT
- Title: Midtraining Bridges Pretraining and Posttraining Distributions
- Authors: Emmy Liu, Graham Neubig, Chenyan Xiong,
- Abstract summary: "Midtraining" is a phase in which higher quality, often instruction-formatted data is mixed in at the end of pretraining.<n>We conduct the first systematic investigation of midtraining through experiments with language models pretrained from scratch.<n>We find that when compared after supervised fine-tuning, the effectiveness of midtraining is highest in the math and code domains.
- Score: 73.84346031272473
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Recently, many language models have been pretrained with a "midtraining" phase, in which higher quality, often instruction-formatted data, is mixed in at the end of pretraining. Despite the popularity of this practice, there is little scientific understanding of this phase of model training or why it is effective. In this work, we conduct the first systematic investigation of midtraining through controlled experiments with language models pretrained from scratch and fine-tuned on supervised finetuning datasets in different domains. We find that when compared after supervised fine-tuning, the effectiveness of midtraining is highest in the math and code domains, where midtraining can best reduce the syntactic gap between pretraining and posttraining data. In these cases, midtraining consistently outperforms continued pretraining in both in-domain validation loss as well as pretraining data forgetting after posttraining. We conduct ablations on the starting time of the midtraining phase and mixture weights of the midtraining data, using code midtraining as a case study, and find that timing has a greater impact than mixture weights, with earlier introduction of specialized data, yielding greater benefits in-domain as well as preserving general language modeling better. These findings establish midtraining as a domain adaptation technique that compared to continued pretraining yields better performance through reduced forgetting.
Related papers
- ReMiT: RL-Guided Mid-Training for Iterative LLM Evolution [49.496216822640974]
We analyze training dynamics and identify the mid-training phase as a critical turning point for model capabilities.<n>We introduce ReMiT (Reinforcement Learning-Guided Mid-Training), which prioritizes tokens during the mid-training phase, prioritizing those pivotal for reasoning.
arXiv Detail & Related papers (2026-02-03T04:04:41Z) - Front-Loading Reasoning: The Synergy between Pretraining and Post-Training Data [68.85234898614571]
The prevailing paradigm for enhancing the reasoning abilities of LLMs revolves around post-training on high-quality, reasoning-intensive data.<n>While emerging literature suggests that reasoning data is increasingly incorporated also during the mid-training stage, the role of such data in pretraining remains unclear.<n>We conduct the first systematic study of how reasoning data-varying in scale, diversity, and quality-affects LLM performance when introduced at different stages of training.
arXiv Detail & Related papers (2025-09-26T20:08:51Z) - RLP: Reinforcement as a Pretraining Objective [103.45068938532923]
We present an information-driven reinforcement pretraining objective that brings the core spirit of reinforcement learning -- exploration -- to the last phase of pretraining.<n>This training objective essentially encourages the model to think for itself before predicting what comes next, thus teaching an independent thinking behavior earlier in the pretraining.<n> Specifically, RLP reframes reinforcement learning for reasoning as a pretraining objective on ordinary text, bridging the gap between next-token prediction and the emergence of useful chain-of-thought reasoning.
arXiv Detail & Related papers (2025-09-26T17:53:54Z) - A Comparative Study of Pre-training and Self-training [0.40964539027092917]
We propose an ensemble method to empirical study all feasible training paradigms combining pre-training, self-training, and fine-tuning.
We conduct experiments on six datasets, four data augmentation, and imbalanced data for sentiment analysis and natural language inference tasks.
Our findings confirm that the pre-training and fine-tuning paradigm yields the best overall performances.
arXiv Detail & Related papers (2024-09-04T14:30:13Z) - Understanding and Mitigating the Label Noise in Pre-training on
Downstream Tasks [91.15120211190519]
This paper aims to understand the nature of noise in pre-training datasets and to mitigate its impact on downstream tasks.
We propose a light-weight black-box tuning method (NMTune) to affine the feature space to mitigate the malignant effect of noise.
arXiv Detail & Related papers (2023-09-29T06:18:15Z) - Examining the Effect of Pre-training on Time Series Classification [21.38211396933795]
This study investigates the impact of pre-training followed by fine-tuning on the fine-tuning process.
We conducted a thorough examination of 150 classification datasets.
We find that pre-training can only help improve the optimization process for models that fit the data poorly.
Adding more pre-training data does not improve generalization, but it can strengthen the advantage of pre-training on the original data volume.
arXiv Detail & Related papers (2023-09-11T06:26:57Z) - Downstream Datasets Make Surprisingly Good Pretraining Corpora [39.77171117174906]
This paper introduces a large-scale study of self-pretraining, where the same (downstream) training data is used for both pretraining and finetuning.
In experiments addressing both ELECTRA and RoBERTa models and 10 distinct downstream classification datasets, we observe that self-pretraining rivals standard pretraining on the BookWiki corpus.
Our results hint that performance gains attributable to pretraining are driven primarily by the pretraining objective itself and are not always attributable to the use of external pretraining data in massive amounts.
arXiv Detail & Related papers (2022-09-28T19:28:43Z) - Continual Pre-Training Mitigates Forgetting in Language and Vision [43.80547864450793]
We show that continually pre-trained models are robust against catastrophic forgetting.
We provide empirical evidence supporting the fact that self-supervised pre-training is more effective in retaining previous knowledge than supervised protocols.
arXiv Detail & Related papers (2022-05-19T07:27:12Z) - Don't Stop Pretraining: Adapt Language Models to Domains and Tasks [81.99843216550306]
We present a study across four domains (biomedical and computer science publications, news, and reviews) and eight classification tasks.
A second phase of pretraining in-domain (domain-adaptive pretraining) leads to performance gains.
Adapting to the task's unlabeled data (task-adaptive pretraining) improves performance even after domain-adaptive pretraining.
arXiv Detail & Related papers (2020-04-23T04:21:19Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.