Predicting the Order of Upcoming Tokens Improves Language Modeling
- URL: http://arxiv.org/abs/2508.19228v1
- Date: Tue, 26 Aug 2025 17:43:30 GMT
- Title: Predicting the Order of Upcoming Tokens Improves Language Modeling
- Authors: Zayd M. K. Zuhri, Erland Hilman Fuadi, Alham Fikri Aji,
- Abstract summary: Multi-Token Prediction (MTP) has been proposed as an auxiliary objective to improve next-token prediction (NTP) in language model training.<n>We argue that MTP's exact future token prediction is too difficult as an auxiliary loss.<n>We propose Token Order Prediction (TOP) which trains models to order upcoming tokens by their proximity using a learning-to-rank loss.
- Score: 15.048237391054611
- License: http://creativecommons.org/licenses/by-sa/4.0/
- Abstract: Multi-Token Prediction (MTP) has been proposed as an auxiliary objective to improve next-token prediction (NTP) in language model training but shows inconsistent improvements, underperforming in standard NLP benchmarks. We argue that MTP's exact future token prediction is too difficult as an auxiliary loss. Instead, we propose Token Order Prediction (TOP), which trains models to order upcoming tokens by their proximity using a learning-to-rank loss. TOP requires only a single additional unembedding layer compared to MTP's multiple transformer layers. We pretrain models of 340M, 1.8B, and 7B parameters using NTP, MTP, and TOP objectives. Results on eight standard NLP benchmarks show that TOP overall outperforms both NTP and MTP even at scale. Our code is available at https://github.com/zaydzuhri/token-order-prediction
Related papers
- Next Concept Prediction in Discrete Latent Space Leads to Stronger Language Models [62.054835560934066]
Next Concept Prediction is a generative pretraining paradigm built on top of Next Token Prediction.<n>Our model, ConceptLM, quantizes hidden states using Vector Quantization and constructs a concept vocabulary.<n>Results on 13 benchmarks show that NCP yields consistent performance gains over traditional token-level models.
arXiv Detail & Related papers (2026-02-09T18:33:31Z) - MTP-S2UT: Enhancing Speech-to-Speech Translation Quality with Multi-token Prediction [49.92201266421949]
We introduce multi-token prediction (MTP) loss into speech-to-unit translation (S2UT) models.<n>We show that all MTP loss variants consistently improve the quality of S2UT translation.
arXiv Detail & Related papers (2025-10-11T04:06:20Z) - Cautious Next Token Prediction [62.74127603725369]
We propose a new training-free decoding strategy, dubbed as Cautious Next Token Prediction (CNTP)<n>In the decoding process, if the model has comparatively high prediction entropy at a certain step, we sample multiple trials starting from the step independently and stop when encountering any punctuation.<n>We show that our proposed CNTP approach outperforms existing standard decoding strategies consistently by a clear margin.
arXiv Detail & Related papers (2025-07-03T05:49:18Z) - Pre-Training Curriculum for Multi-Token Prediction in Language Models [2.8071268036220003]
Multi-token prediction (MTP) is a recently proposed pre-training objective for language models.<n>We propose a curriculum learning strategy for MTP training, exploring two variants: a forward curriculum and a reverse curriculum.
arXiv Detail & Related papers (2025-05-28T18:19:18Z) - Fast Quiet-STaR: Thinking Without Thought Tokens [51.79231070632772]
Fast Quiet STaR is a more efficient reasoning framework that preserves the benefits of token-level reasoning while reducing computational cost.<n>Our method introduces a curriculum learning based training strategy that gradually reduces the number of thought tokens.<n>Experiments on four benchmark datasets with Mistral 7B and Qwen2.5 7B demonstrate that Fast Quiet-STaR consistently outperforms Quiet-STaR in terms of average accuracy.
arXiv Detail & Related papers (2025-05-23T11:14:12Z) - L-MTP: Leap Multi-Token Prediction Beyond Adjacent Context for Large Language Models [69.1271366892683]
We propose leap multi-token prediction(L-MTP), an innovative token prediction method.<n>Unlike conventional MTP, L-MTP strategically skips over intermediate tokens, predicting non-sequential ones in a single forward pass.<n>We theoretically demonstrate the benefit of L-MTP in improving inference efficiency.
arXiv Detail & Related papers (2025-05-23T05:59:46Z) - Efficient Joint Prediction of Multiple Future Tokens [20.647830092055955]
We introduce joint multi-token prediction (JTP), a lightweight modification of standard next-token prediction.<n>Unlike previous multi-token prediction approaches, JTP strategically employs teacher forcing of future-tokens.<n>We show that the JTP approach achieves a short-horizon belief state representation, while popular alternatives for multi-token prediction fail to do so.
arXiv Detail & Related papers (2025-03-24T19:52:42Z) - Improving Next Tokens via Second-to-Last Predictions with Generate and Refine [1.8592384822257952]
We train a decoder-only architecture for predicting the second to last token for a sequence of tokens.<n>Our approach yields higher computational training efficiency than BERT-style models.
arXiv Detail & Related papers (2024-11-23T22:09:58Z) - Model-tuning Via Prompts Makes NLP Models Adversarially Robust [97.02353907677703]
We show surprising gains in adversarial robustness enjoyed by Model-tuning Via Prompts (MVP)
MVP improves performance against adversarial substitutions by an average of 8% over standard methods.
We also conduct ablations to investigate the mechanism underlying these gains.
arXiv Detail & Related papers (2023-03-13T17:41:57Z) - Fast-iTPN: Integrally Pre-Trained Transformer Pyramid Network with Token
Migration [138.24994198567794]
iTPN is born with two elaborated designs: 1) The first pre-trained feature pyramid upon vision transformer (ViT)
Fast-iTPN can accelerate the inference procedure by up to 70%, with negligible performance loss.
arXiv Detail & Related papers (2022-11-23T06:56:12Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.