Provable Long-Range Benefits of Next-Token Prediction
- URL: http://arxiv.org/abs/2512.07818v1
- Date: Mon, 08 Dec 2025 18:51:54 GMT
- Title: Provable Long-Range Benefits of Next-Token Prediction
- Authors: Xinyuan Cao, Santosh S. Vempala,
- Abstract summary: We show that next-token prediction is provably powerful for learning longer-range structure.<n>We provide an explanation for the long-range coherence observed in practice.
- Score: 11.043470114967775
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Why do modern language models, trained to do well on next-word prediction, appear to generate coherent documents and capture long-range structure? Here we show that next-token prediction is provably powerful for learning longer-range structure, even with common neural network architectures. Specifically, we prove that optimizing next-token prediction over a Recurrent Neural Network (RNN) yields a model that closely approximates the training distribution: for held-out documents sampled from the training distribution, no algorithm of bounded description length limited to examining the next $k$ tokens, for any $k$, can distinguish between $k$ consecutive tokens of such documents and $k$ tokens generated by the learned language model following the same prefix. We provide polynomial bounds (in $k$, independent of the document length) on the model size needed to achieve such $k$-token indistinguishability, offering a complexity-theoretic explanation for the long-range coherence observed in practice.
Related papers
- Multi-Token Prediction via Self-Distillation [73.81494481537636]
We consider a new approach for converting a pretrained autoregressive language model from a slow single next token prediction model into a fast standalone multi-token prediction model.<n>On GSM8K, our method produces models that can decode more than $3times$ faster on average at $5%$ drop in accuracy relative to single token decoding performance.
arXiv Detail & Related papers (2026-02-05T18:54:48Z) - Speculative Decoding Speed-of-Light: Optimal Lower Bounds via Branching Random Walks [39.54576236079211]
Speculative generation has emerged as a promising technique to accelerate inference in large language models.<n>In this work, we establish the first tight'' lower bounds on the runtime of any deterministic speculative generation algorithm.
arXiv Detail & Related papers (2025-12-12T16:54:33Z) - Continuous Autoregressive Language Models [56.49239051750678]
We introduce Continuous Autoregressive Language Models (CALM)<n>CALM uses a high-fidelity autoencoder to compress a chunk of K tokens into a single continuous vector.<n>We develop a comprehensive likelihood-free framework that enables robust training, evaluation, and controllable sampling.
arXiv Detail & Related papers (2025-10-31T17:58:11Z) - Context-level Language Modeling by Learning Predictive Context Embeddings [79.00607069677393]
We introduce textbfContextLM, a framework that augments standard pretraining with an inherent textbfnext-context prediction objective.<n>This mechanism trains the model to learn predictive representations of multi-token contexts, leveraging error signals derived from future token chunks.<n>Experiments on the GPT2 and Pythia model families, scaled up to $1.5$B parameters, show that ContextLM delivers consistent improvements in both perplexity and downstream task performance.
arXiv Detail & Related papers (2025-10-23T07:09:45Z) - How Reinforcement Learning After Next-Token Prediction Facilitates Learning [36.98696363889831]
We study learning from mixture distributions of short and long chain-of-thought'' sequences encoding a single task.<n>We show how reinforcement learning after next-token prediction enables autoregressive transformers to generalize.
arXiv Detail & Related papers (2025-10-13T15:04:00Z) - Token Weighting for Long-Range Language Modeling [50.2371550397256]
We propose novel token-weighting schemes that assign different weights to each training token in the loss.<n>We evaluate all methods on multiple long-context understanding tasks and show that non-uniform loss weights are helpful.<n>This work contributes to a better understanding of the trade-offs long-context language modeling faces.
arXiv Detail & Related papers (2025-03-12T09:46:59Z) - FIRP: Faster LLM inference via future intermediate representation prediction [54.897493351694195]
FIRP generates multiple tokens instead of one at each decoding step.
We conduct extensive experiments, showing a speedup ratio of 1.9x-3x in several models and datasets.
arXiv Detail & Related papers (2024-10-27T15:53:49Z) - Faster Language Models with Better Multi-Token Prediction Using Tensor Decomposition [5.575078692353885]
We propose a new model for multi-token prediction in transformers, aiming to enhance sampling efficiency without compromising accuracy.<n>By generalizing it to a rank-$r$ canonical probability decomposition, we develop an improved model that predicts multiple tokens simultaneously.
arXiv Detail & Related papers (2024-10-23T11:06:36Z) - Efficient Training of Language Models with Compact and Consistent Next Token Distributions [23.312920633391837]
We show that we can train better models faster by pre-aggregating the corpus with a collapsed $n$-gram distribution.
Our approximation facilitates scalability of gains to larger datasets and models.
arXiv Detail & Related papers (2024-07-03T05:40:41Z) - TokenUnify: Scaling Up Autoregressive Pretraining for Neuron Segmentation [65.65530016765615]
We propose a hierarchical predictive coding framework that captures multi-scale dependencies through three complementary learning objectives.<n> TokenUnify integrates random token prediction, next-token prediction, and next-all token prediction to create a comprehensive representational space.<n>We also introduce a large-scale EM dataset with 1.2 billion annotated voxels, offering ideal long-sequence visual data with spatial continuity.
arXiv Detail & Related papers (2024-05-27T05:45:51Z) - Auto-Regressive Next-Token Predictors are Universal Learners [17.416520406390415]
We show that even simple models such as linear next-token predictors can approximate any function efficiently computed by a Turing machine.
We also show experimentally that simple next-token predictors, such as linear networks and shallow Multi-Layer Perceptrons (MLPs), display non-trivial performance on text generation and arithmetic tasks.
arXiv Detail & Related papers (2023-09-13T14:15:03Z) - Fast End-to-End Speech Recognition via a Non-Autoregressive Model and
Cross-Modal Knowledge Transferring from BERT [72.93855288283059]
We propose a non-autoregressive speech recognition model called LASO (Listen Attentively, and Spell Once)
The model consists of an encoder, a decoder, and a position dependent summarizer (PDS)
arXiv Detail & Related papers (2021-02-15T15:18:59Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.