A Survey on LLM Mid-Training
- URL: http://arxiv.org/abs/2510.23081v2
- Date: Tue, 04 Nov 2025 11:00:12 GMT
- Title: A Survey on LLM Mid-Training
- Authors: Chengying Tu, Xuemiao Zhang, Rongxiang Weng, Rumei Li, Chen Zhang, Yang Bai, Hongfei Yan, Jingang Wang, Xunliang Cai,
- Abstract summary: Mid-training is a vital stage that bridges pre-training and post-training.<n>This survey provides a formal definition of mid-training for large language models (LLMs)
- Score: 38.57944803666373
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Recent advances in foundation models have highlighted the significant benefits of multi-stage training, with a particular emphasis on the emergence of mid-training as a vital stage that bridges pre-training and post-training. Mid-training is distinguished by its use of intermediate data and computational resources, systematically enhancing specified capabilities such as mathematics, coding, reasoning, and long-context extension, while maintaining foundational competencies. This survey provides a formal definition of mid-training for large language models (LLMs) and investigates optimization frameworks that encompass data curation, training strategies, and model architecture optimization. We analyze mainstream model implementations in the context of objective-driven interventions, illustrating how mid-training serves as a distinct and critical stage in the progressive development of LLM capabilities. By clarifying the unique contributions of mid-training, this survey offers a comprehensive taxonomy and actionable insights, supporting future research and innovation in the advancement of LLMs.
Related papers
- Mid-Training of Large Language Models: A Survey [12.322464058364405]
Large language models (LLMs) are typically developed through large-scale pre-training followed by task-specific fine-tuning.<n>Recent advances highlight the importance of an intermediate mid-training stage.<n>We introduce the first taxonomy of mid-training spanning data distribution, learning-rate scheduling, and long-context extension.
arXiv Detail & Related papers (2025-10-08T09:49:37Z) - EvoLM: In Search of Lost Language Model Training Dynamics [97.69616550374579]
EvoLM is a model suite that enables systematic and transparent analysis of LMs' training dynamics across pre-training, continued pre-training, supervised fine-tuning, and reinforcement learning.<n>By training over 100 LMs with 1B and 4B parameters from scratch, we rigorously evaluate both upstream (language modeling) and downstream (problem-solving) reasoning capabilities.
arXiv Detail & Related papers (2025-06-19T04:58:47Z) - LLM Post-Training: A Deep Dive into Reasoning Large Language Models [131.10969986056]
Large Language Models (LLMs) have transformed the natural language processing landscape and brought to life diverse applications.<n>Post-training methods enable LLMs to refine their knowledge, improve reasoning, enhance factual accuracy, and align more effectively with user intents and ethical considerations.
arXiv Detail & Related papers (2025-02-28T18:59:54Z) - Hephaestus: Improving Fundamental Agent Capabilities of Large Language Models through Continual Pre-Training [69.13064064991552]
Hephaestus-Forge is a large-scale pre-training corpus designed to enhance the capabilities of LLM agents in API function calling, intrinsic reasoning and planning.<n>Hephaestus-Forge comprises 103B agent-specific data encompassing 76,537 APIs, including both tool documentation to introduce knowledge of API functions and function calling trajectories.<n>By continual pre-training on Hephaestus-Forge, Hephaestus outperforms small- to medium-scale open-source LLMs and rivals commercial LLMs on three agent benchmarks.
arXiv Detail & Related papers (2025-02-10T15:54:34Z) - Deciphering Cross-Modal Alignment in Large Vision-Language Models with Modality Integration Rate [118.37653302885607]
We present the Modality Integration Rate (MIR), an effective, robust, and generalized metric to indicate the multi-modal pre-training quality of Large Vision Language Models (LVLMs)
MIR is indicative about training data selection, training strategy schedule, and model architecture design to get better pre-training results.
arXiv Detail & Related papers (2024-10-09T17:59:04Z) - Recent Advances in Federated Learning Driven Large Language Models: A Survey on Architecture, Performance, and Security [24.969739515876515]
Federated Learning (FL) offers a promising paradigm for training Large Language Models (LLMs) in a decentralized manner while preserving data privacy and minimizing communication overhead.<n>We review a range of strategies enabling unlearning in federated LLMs, including perturbation-based methods, model decomposition, and incremental retraining.<n>This survey identifies critical research directions toward developing secure, adaptable, and high-performing federated LLM systems for real-world deployment.
arXiv Detail & Related papers (2024-06-14T08:40:58Z) - Understanding LLMs: A Comprehensive Overview from Training to Inference [52.70748499554532]
Low-cost training and deployment of large language models represent the future development trend.
Discussion on training includes various aspects, including data preprocessing, training architecture, pre-training tasks, parallel training, and relevant content related to model fine-tuning.
On the inference side, the paper covers topics such as model compression, parallel computation, memory scheduling, and structural optimization.
arXiv Detail & Related papers (2024-01-04T02:43:57Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.