Related papers: A Survey on LLM Mid-Training

A Survey on LLM Mid-Training

URL: http://arxiv.org/abs/2510.23081v2
Date: Tue, 04 Nov 2025 11:00:12 GMT
Title: A Survey on LLM Mid-Training
Authors: Chengying Tu, Xuemiao Zhang, Rongxiang Weng, Rumei Li, Chen Zhang, Yang Bai, Hongfei Yan, Jingang Wang, Xunliang Cai,
Abstract summary: Mid-training is a vital stage that bridges pre-training and post-training.<n>This survey provides a formal definition of mid-training for large language models (LLMs)
Score: 38.57944803666373
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Recent advances in foundation models have highlighted the significant benefits of multi-stage training, with a particular emphasis on the emergence of mid-training as a vital stage that bridges pre-training and post-training. Mid-training is distinguished by its use of intermediate data and computational resources, systematically enhancing specified capabilities such as mathematics, coding, reasoning, and long-context extension, while maintaining foundational competencies. This survey provides a formal definition of mid-training for large language models (LLMs) and investigates optimization frameworks that encompass data curation, training strategies, and model architecture optimization. We analyze mainstream model implementations in the context of objective-driven interventions, illustrating how mid-training serves as a distinct and critical stage in the progressive development of LLM capabilities. By clarifying the unique contributions of mid-training, this survey offers a comprehensive taxonomy and actionable insights, supporting future research and innovation in the advancement of LLMs.

Related papers

Mid-Training of Large Language Models: A Survey [12.322464058364405]
Large language models (LLMs) are typically developed through large-scale pre-training followed by task-specific fine-tuning.<n>Recent advances highlight the importance of an intermediate mid-training stage.<n>We introduce the first taxonomy of mid-training spanning data distribution, learning-rate scheduling, and long-context extension.
arXiv Detail & Related papers (2025-10-08T09:49:37Z)
EvoLM: In Search of Lost Language Model Training Dynamics [97.69616550374579]
EvoLM is a model suite that enables systematic and transparent analysis of LMs' training dynamics across pre-training, continued pre-training, supervised fine-tuning, and reinforcement learning.<n>By training over 100 LMs with 1B and 4B parameters from scratch, we rigorously evaluate both upstream (language modeling) and downstream (problem-solving) reasoning capabilities.
arXiv Detail & Related papers (2025-06-19T04:58:47Z)
LLM Post-Training: A Deep Dive into Reasoning Large Language Models [131.10969986056]
Large Language Models (LLMs) have transformed the natural language processing landscape and brought to life diverse applications.<n>Post-training methods enable LLMs to refine their knowledge, improve reasoning, enhance factual accuracy, and align more effectively with user intents and ethical considerations.
arXiv Detail & Related papers (2025-02-28T18:59:54Z)
Hephaestus: Improving Fundamental Agent Capabilities of Large Language Models through Continual Pre-Training [69.13064064991552]
Hephaestus-Forge is a large-scale pre-training corpus designed to enhance the capabilities of LLM agents in API function calling, intrinsic reasoning and planning.<n>Hephaestus-Forge comprises 103B agent-specific data encompassing 76,537 APIs, including both tool documentation to introduce knowledge of API functions and function calling trajectories.<n>By continual pre-training on Hephaestus-Forge, Hephaestus outperforms small- to medium-scale open-source LLMs and rivals commercial LLMs on three agent benchmarks.
arXiv Detail & Related papers (2025-02-10T15:54:34Z)
Deciphering Cross-Modal Alignment in Large Vision-Language Models with Modality Integration Rate [118.37653302885607]
We present the Modality Integration Rate (MIR), an effective, robust, and generalized metric to indicate the multi-modal pre-training quality of Large Vision Language Models (LVLMs) MIR is indicative about training data selection, training strategy schedule, and model architecture design to get better pre-training results.
arXiv Detail & Related papers (2024-10-09T17:59:04Z)
Recent Advances in Federated Learning Driven Large Language Models: A Survey on Architecture, Performance, and Security [24.969739515876515]
Federated Learning (FL) offers a promising paradigm for training Large Language Models (LLMs) in a decentralized manner while preserving data privacy and minimizing communication overhead.<n>We review a range of strategies enabling unlearning in federated LLMs, including perturbation-based methods, model decomposition, and incremental retraining.<n>This survey identifies critical research directions toward developing secure, adaptable, and high-performing federated LLM systems for real-world deployment.
arXiv Detail & Related papers (2024-06-14T08:40:58Z)
Understanding LLMs: A Comprehensive Overview from Training to Inference [52.70748499554532]
Low-cost training and deployment of large language models represent the future development trend. Discussion on training includes various aspects, including data preprocessing, training architecture, pre-training tasks, parallel training, and relevant content related to model fine-tuning. On the inference side, the paper covers topics such as model compression, parallel computation, memory scheduling, and structural optimization.
arXiv Detail & Related papers (2024-01-04T02:43:57Z)

This list is automatically generated from the titles and abstracts of the papers in this site.