Pre-Training Curriculum for Multi-Token Prediction in Language Models
- URL: http://arxiv.org/abs/2505.22757v1
- Date: Wed, 28 May 2025 18:19:18 GMT
- Title: Pre-Training Curriculum for Multi-Token Prediction in Language Models
- Authors: Ansar Aynetdinov, Alan Akbik,
- Abstract summary: Multi-token prediction (MTP) is a recently proposed pre-training objective for language models.<n>We propose a curriculum learning strategy for MTP training, exploring two variants: a forward curriculum and a reverse curriculum.
- Score: 2.8071268036220003
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Multi-token prediction (MTP) is a recently proposed pre-training objective for language models. Rather than predicting only the next token (NTP), MTP predicts the next $k$ tokens at each prediction step, using multiple prediction heads. MTP has shown promise in improving downstream performance, inference speed, and training efficiency, particularly for large models. However, prior work has shown that smaller language models (SLMs) struggle with the MTP objective. To address this, we propose a curriculum learning strategy for MTP training, exploring two variants: a forward curriculum, which gradually increases the complexity of the pre-training objective from NTP to MTP, and a reverse curriculum, which does the opposite. Our experiments show that the forward curriculum enables SLMs to better leverage the MTP objective during pre-training, improving downstream NTP performance and generative output quality, while retaining the benefits of self-speculative decoding. The reverse curriculum achieves stronger NTP performance and output quality, but fails to provide any self-speculative decoding benefits.
Related papers
- Next Concept Prediction in Discrete Latent Space Leads to Stronger Language Models [62.054835560934066]
Next Concept Prediction is a generative pretraining paradigm built on top of Next Token Prediction.<n>Our model, ConceptLM, quantizes hidden states using Vector Quantization and constructs a concept vocabulary.<n>Results on 13 benchmarks show that NCP yields consistent performance gains over traditional token-level models.
arXiv Detail & Related papers (2026-02-09T18:33:31Z) - Beyond Multi-Token Prediction: Pretraining LLMs with Future Summaries [35.39150917025755]
Future summary prediction (FSP) trains an auxiliary head to predict a compact representation of the long-term future.<n>FSP provides improvements over both NTP and MTP across math, reasoning, and coding benchmarks.
arXiv Detail & Related papers (2025-10-16T14:52:52Z) - MTP-S2UT: Enhancing Speech-to-Speech Translation Quality with Multi-token Prediction [49.92201266421949]
We introduce multi-token prediction (MTP) loss into speech-to-unit translation (S2UT) models.<n>We show that all MTP loss variants consistently improve the quality of S2UT translation.
arXiv Detail & Related papers (2025-10-11T04:06:20Z) - FastMTP: Accelerating LLM Inference with Enhanced Multi-Token Prediction [11.691960175716163]
This paper introduces FastMTP, a method that improves multi-step draft quality by aligning MTP training with its inference pattern.<n>Our approach fine-tunes a single MTP head with position-shared weights on self-distilled data, enabling it to capture dependencies among consecutive future tokens.<n> Experimental results across seven diverse benchmarks demonstrate that FastMTP achieves an average of 2.03x speedup compared to standard next token prediction.
arXiv Detail & Related papers (2025-09-16T07:36:26Z) - Predicting the Order of Upcoming Tokens Improves Language Modeling [15.048237391054611]
Multi-Token Prediction (MTP) has been proposed as an auxiliary objective to improve next-token prediction (NTP) in language model training.<n>We argue that MTP's exact future token prediction is too difficult as an auxiliary loss.<n>We propose Token Order Prediction (TOP) which trains models to order upcoming tokens by their proximity using a learning-to-rank loss.
arXiv Detail & Related papers (2025-08-26T17:43:30Z) - Cautious Next Token Prediction [62.74127603725369]
We propose a new training-free decoding strategy, dubbed as Cautious Next Token Prediction (CNTP)<n>In the decoding process, if the model has comparatively high prediction entropy at a certain step, we sample multiple trials starting from the step independently and stop when encountering any punctuation.<n>We show that our proposed CNTP approach outperforms existing standard decoding strategies consistently by a clear margin.
arXiv Detail & Related papers (2025-07-03T05:49:18Z) - L-MTP: Leap Multi-Token Prediction Beyond Adjacent Context for Large Language Models [69.1271366892683]
We propose leap multi-token prediction(L-MTP), an innovative token prediction method.<n>Unlike conventional MTP, L-MTP strategically skips over intermediate tokens, predicting non-sequential ones in a single forward pass.<n>We theoretically demonstrate the benefit of L-MTP in improving inference efficiency.
arXiv Detail & Related papers (2025-05-23T05:59:46Z) - Efficient Joint Prediction of Multiple Future Tokens [20.647830092055955]
We introduce joint multi-token prediction (JTP), a lightweight modification of standard next-token prediction.<n>Unlike previous multi-token prediction approaches, JTP strategically employs teacher forcing of future-tokens.<n>We show that the JTP approach achieves a short-horizon belief state representation, while popular alternatives for multi-token prediction fail to do so.
arXiv Detail & Related papers (2025-03-24T19:52:42Z) - On multi-token prediction for efficient LLM inference [0.36681882674260474]
We first show that such models inherently possess MTP capabilities via numerical marginalization over intermediate token probabilities.<n>We then explore the challenges of integrating MTP heads into frozen LLMs and find that their hidden layers are strongly specialized for NTP.
arXiv Detail & Related papers (2025-02-13T15:42:44Z) - Reasoning Bias of Next Token Prediction Training [5.188841610098436]
Next token prediction (NTP) is the dominant training paradigm for Large Language Models (LLMs)<n>We show that despite NTP's exposure to noise during training, it surpasses in reasoning ability.<n>We attribute this counterintuitive outcome to the regularizing influence of noise on the training dynamics.
arXiv Detail & Related papers (2025-02-04T04:46:41Z) - NDP: Next Distribution Prediction as a More Broad Target [59.30497395313209]
We introduce Next Distribution Prediction (NDP), which uses $n$-gram distributions to replace the one-hot targets.
NDP can achieve up to +2.97 COMET improvement in translation tasks, +0.61 average improvement in general tasks, and incredible +10.75 average improvement in the medical domain.
arXiv Detail & Related papers (2024-08-30T16:13:49Z) - Making Pre-trained Language Models both Task-solvers and
Self-calibrators [52.98858650625623]
Pre-trained language models (PLMs) serve as backbones for various real-world systems.
Previous work shows that introducing an extra calibration task can mitigate this issue.
We propose a training algorithm LM-TOAST to tackle the challenges.
arXiv Detail & Related papers (2023-07-21T02:51:41Z) - Approximated Prompt Tuning for Vision-Language Pre-trained Models [54.326232586461614]
In vision-language pre-trained models, prompt tuning often requires a large number of learnable tokens to bridge the gap between the pre-training and downstream tasks.
We propose a novel Approximated Prompt Tuning (APT) approach towards efficient VL transfer learning.
arXiv Detail & Related papers (2023-06-27T05:43:47Z) - Test-Time Prompt Tuning for Zero-Shot Generalization in Vision-Language
Models [107.05966685291067]
We propose test-time prompt tuning (TPT) to learn adaptive prompts on the fly with a single test sample.
TPT improves the zero-shot top-1 accuracy of CLIP by 3.6% on average.
In evaluating cross-dataset generalization with unseen categories, TPT performs on par with the state-of-the-art approaches that use additional training data.
arXiv Detail & Related papers (2022-09-15T17:55:11Z) - MTI-Net: A Multi-Target Speech Intelligibility Prediction Model [25.124218779681875]
This study proposes a multi-task speech intelligibility prediction model, called MTI-Net, for simultaneously predicting human and machine intelligibility measures.
Specifically, given a speech utterance, MTI-Net is designed to predict subjective listening test results and word error rate (WER) scores.
arXiv Detail & Related papers (2022-04-07T09:17:04Z) - Pre-training Text Representations as Meta Learning [113.3361289756749]
We introduce a learning algorithm which directly optimize model's ability to learn text representations for effective learning of downstream tasks.
We show that there is an intrinsic connection between multi-task pre-training and model-agnostic meta-learning with a sequence of meta-train steps.
arXiv Detail & Related papers (2020-04-12T09:05:47Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.