Related papers: Beyond Multi-Token Prediction: Pretraining LLMs with Future Summaries

Beyond Multi-Token Prediction: Pretraining LLMs with Future Summaries

URL: http://arxiv.org/abs/2510.14751v1
Date: Thu, 16 Oct 2025 14:52:52 GMT
Title: Beyond Multi-Token Prediction: Pretraining LLMs with Future Summaries
Authors: Divyat Mahajan, Sachin Goyal, Badr Youbi Idrissi, Mohammad Pezeshki, Ioannis Mitliagkas, David Lopez-Paz, Kartik Ahuja,
Abstract summary: Future summary prediction (FSP) trains an auxiliary head to predict a compact representation of the long-term future.<n>FSP provides improvements over both NTP and MTP across math, reasoning, and coding benchmarks.
Score: 35.39150917025755
License: http://creativecommons.org/licenses/by-nc-sa/4.0/
Abstract: Next-token prediction (NTP) has driven the success of large language models (LLMs), but it struggles with long-horizon reasoning, planning, and creative writing, with these limitations largely attributed to teacher-forced training. Multi-token prediction (MTP) partially mitigates these issues by predicting several future tokens at once, but it mostly captures short-range dependencies and offers limited improvement. We propose future summary prediction (FSP), which trains an auxiliary head to predict a compact representation of the long-term future, preserving information relevant for long-form generations. We explore two variants of FSP: handcrafted summaries, for example, a bag of words summary of the future of the sequence, and learned summaries, which use embeddings produced by a reverse language model trained from right to left. Large-scale pretraining experiments (3B and 8B-parameter models) demonstrate that FSP provides improvements over both NTP and MTP across math, reasoning, and coding benchmarks.

Related papers

Reinforced Fast Weights with Next-Sequence Prediction [42.710296902935426]
REFINE is a reinforcement learning framework that trains fast weight models under the next-sequence prediction (NSP) objective.<n> REFINE consistently outperforms supervised fine-tuning with NTP across needle-in-a-haystack retrieval, long-context question answering, and diverse tasks in LongBench.
arXiv Detail & Related papers (2026-02-18T18:53:18Z)
Next Concept Prediction in Discrete Latent Space Leads to Stronger Language Models [62.054835560934066]
Next Concept Prediction is a generative pretraining paradigm built on top of Next Token Prediction.<n>Our model, ConceptLM, quantizes hidden states using Vector Quantization and constructs a concept vocabulary.<n>Results on 13 benchmarks show that NCP yields consistent performance gains over traditional token-level models.
arXiv Detail & Related papers (2026-02-09T18:33:31Z)
Context-level Language Modeling by Learning Predictive Context Embeddings [79.00607069677393]
We introduce textbfContextLM, a framework that augments standard pretraining with an inherent textbfnext-context prediction objective.<n>This mechanism trains the model to learn predictive representations of multi-token contexts, leveraging error signals derived from future token chunks.<n>Experiments on the GPT2 and Pythia model families, scaled up to $1.5$B parameters, show that ContextLM delivers consistent improvements in both perplexity and downstream task performance.
arXiv Detail & Related papers (2025-10-23T07:09:45Z)
Your LLM Knows the Future: Uncovering Its Multi-Token Prediction Potential [12.719829360337833]
We propose a novel framework that leverages the inherent knowledge of vanilla autoregressive language models about future tokens.<n>Our method achieves significant speedups through supervised fine-tuning on pretrained models.
arXiv Detail & Related papers (2025-07-16T02:31:40Z)
Reinforcement Pre-Training [78.5355979575498]
We introduce Reinforcement Pre-Training (RPT) as a new scaling paradigm for large language models and reinforcement learning (RL)<n>RPT offers a scalable method to leverage vast amounts of text data for general-purpose RL, rather than relying on domain-specific annotated answers.<n>The results position RPT as an effective and promising scaling paradigm to advance language model pre-training.
arXiv Detail & Related papers (2025-06-09T17:59:53Z)
Pre-Training Curriculum for Multi-Token Prediction in Language Models [2.8071268036220003]
Multi-token prediction (MTP) is a recently proposed pre-training objective for language models.<n>We propose a curriculum learning strategy for MTP training, exploring two variants: a forward curriculum and a reverse curriculum.
arXiv Detail & Related papers (2025-05-28T18:19:18Z)
L-MTP: Leap Multi-Token Prediction Beyond Adjacent Context for Large Language Models [95.53699156138435]
We propose leap multi-token prediction(L-MTP), an innovative token prediction method.<n>Unlike conventional MTP, L-MTP strategically skips over intermediate tokens, predicting non-sequential ones in a single forward pass.<n>We theoretically demonstrate the benefit of L-MTP in improving inference efficiency.
arXiv Detail & Related papers (2025-05-23T05:59:46Z)
Efficient Joint Prediction of Multiple Future Tokens [20.647830092055955]
We introduce joint multi-token prediction (JTP), a lightweight modification of standard next-token prediction.<n>Unlike previous multi-token prediction approaches, JTP strategically employs teacher forcing of future-tokens.<n>We show that the JTP approach achieves a short-horizon belief state representation, while popular alternatives for multi-token prediction fail to do so.
arXiv Detail & Related papers (2025-03-24T19:52:42Z)
Adversarial Generative Grammars for Human Activity Prediction [141.43526239537502]
We propose an adversarial generative grammar model for future prediction. Our grammar is designed so that it can learn production rules from the data distribution. Being able to select multiple production rules during inference leads to different predicted outcomes.
arXiv Detail & Related papers (2020-08-11T17:47:53Z)
Ambiguity in Sequential Data: Predicting Uncertain Futures with Recurrent Models [110.82452096672182]
We propose an extension of the Multiple Hypothesis Prediction (MHP) model to handle ambiguous predictions with sequential data. We also introduce a novel metric for ambiguous problems, which is better suited to account for uncertainties.
arXiv Detail & Related papers (2020-03-10T09:15:42Z)

This list is automatically generated from the titles and abstracts of the papers in this site.