Related papers: Provable Long-Range Benefits of Next-Token Prediction

Provable Long-Range Benefits of Next-Token Prediction

URL: http://arxiv.org/abs/2512.07818v1
Date: Mon, 08 Dec 2025 18:51:54 GMT
Title: Provable Long-Range Benefits of Next-Token Prediction
Authors: Xinyuan Cao, Santosh S. Vempala,
Abstract summary: We show that next-token prediction is provably powerful for learning longer-range structure.<n>We provide an explanation for the long-range coherence observed in practice.
Score: 11.043470114967775
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Why do modern language models, trained to do well on next-word prediction, appear to generate coherent documents and capture long-range structure? Here we show that next-token prediction is provably powerful for learning longer-range structure, even with common neural network architectures. Specifically, we prove that optimizing next-token prediction over a Recurrent Neural Network (RNN) yields a model that closely approximates the training distribution: for held-out documents sampled from the training distribution, no algorithm of bounded description length limited to examining the next $k$ tokens, for any $k$, can distinguish between $k$ consecutive tokens of such documents and $k$ tokens generated by the learned language model following the same prefix. We provide polynomial bounds (in $k$, independent of the document length) on the model size needed to achieve such $k$-token indistinguishability, offering a complexity-theoretic explanation for the long-range coherence observed in practice.

Related papers

Multi-Token Prediction via Self-Distillation [73.81494481537636]
We consider a new approach for converting a pretrained autoregressive language model from a slow single next token prediction model into a fast standalone multi-token prediction model.<n>On GSM8K, our method produces models that can decode more than $3times$ faster on average at $5%$ drop in accuracy relative to single token decoding performance.
arXiv Detail & Related papers (2026-02-05T18:54:48Z)
Speculative Decoding Speed-of-Light: Optimal Lower Bounds via Branching Random Walks [39.54576236079211]
Speculative generation has emerged as a promising technique to accelerate inference in large language models.<n>In this work, we establish the first tight'' lower bounds on the runtime of any deterministic speculative generation algorithm.
arXiv Detail & Related papers (2025-12-12T16:54:33Z)
Continuous Autoregressive Language Models [56.49239051750678]
We introduce Continuous Autoregressive Language Models (CALM)<n>CALM uses a high-fidelity autoencoder to compress a chunk of K tokens into a single continuous vector.<n>We develop a comprehensive likelihood-free framework that enables robust training, evaluation, and controllable sampling.
arXiv Detail & Related papers (2025-10-31T17:58:11Z)
Context-level Language Modeling by Learning Predictive Context Embeddings [79.00607069677393]
We introduce textbfContextLM, a framework that augments standard pretraining with an inherent textbfnext-context prediction objective.<n>This mechanism trains the model to learn predictive representations of multi-token contexts, leveraging error signals derived from future token chunks.<n>Experiments on the GPT2 and Pythia model families, scaled up to $1.5$B parameters, show that ContextLM delivers consistent improvements in both perplexity and downstream task performance.
arXiv Detail & Related papers (2025-10-23T07:09:45Z)
How Reinforcement Learning After Next-Token Prediction Facilitates Learning [36.98696363889831]
We study learning from mixture distributions of short and long chain-of-thought'' sequences encoding a single task.<n>We show how reinforcement learning after next-token prediction enables autoregressive transformers to generalize.
arXiv Detail & Related papers (2025-10-13T15:04:00Z)
Token Weighting for Long-Range Language Modeling [50.2371550397256]
We propose novel token-weighting schemes that assign different weights to each training token in the loss.<n>We evaluate all methods on multiple long-context understanding tasks and show that non-uniform loss weights are helpful.<n>This work contributes to a better understanding of the trade-offs long-context language modeling faces.
arXiv Detail & Related papers (2025-03-12T09:46:59Z)
FIRP: Faster LLM inference via future intermediate representation prediction [54.897493351694195]
FIRP generates multiple tokens instead of one at each decoding step. We conduct extensive experiments, showing a speedup ratio of 1.9x-3x in several models and datasets.
arXiv Detail & Related papers (2024-10-27T15:53:49Z)
Faster Language Models with Better Multi-Token Prediction Using Tensor Decomposition [5.575078692353885]
We propose a new model for multi-token prediction in transformers, aiming to enhance sampling efficiency without compromising accuracy.<n>By generalizing it to a rank-$r$ canonical probability decomposition, we develop an improved model that predicts multiple tokens simultaneously.
arXiv Detail & Related papers (2024-10-23T11:06:36Z)
Efficient Training of Language Models with Compact and Consistent Next Token Distributions [23.312920633391837]
We show that we can train better models faster by pre-aggregating the corpus with a collapsed $n$-gram distribution. Our approximation facilitates scalability of gains to larger datasets and models.
arXiv Detail & Related papers (2024-07-03T05:40:41Z)
TokenUnify: Scaling Up Autoregressive Pretraining for Neuron Segmentation [65.65530016765615]
We propose a hierarchical predictive coding framework that captures multi-scale dependencies through three complementary learning objectives.<n> TokenUnify integrates random token prediction, next-token prediction, and next-all token prediction to create a comprehensive representational space.<n>We also introduce a large-scale EM dataset with 1.2 billion annotated voxels, offering ideal long-sequence visual data with spatial continuity.
arXiv Detail & Related papers (2024-05-27T05:45:51Z)
Auto-Regressive Next-Token Predictors are Universal Learners [17.416520406390415]
We show that even simple models such as linear next-token predictors can approximate any function efficiently computed by a Turing machine. We also show experimentally that simple next-token predictors, such as linear networks and shallow Multi-Layer Perceptrons (MLPs), display non-trivial performance on text generation and arithmetic tasks.
arXiv Detail & Related papers (2023-09-13T14:15:03Z)
Fast End-to-End Speech Recognition via a Non-Autoregressive Model and Cross-Modal Knowledge Transferring from BERT [72.93855288283059]
We propose a non-autoregressive speech recognition model called LASO (Listen Attentively, and Spell Once) The model consists of an encoder, a decoder, and a position dependent summarizer (PDS)
arXiv Detail & Related papers (2021-02-15T15:18:59Z)

This list is automatically generated from the titles and abstracts of the papers in this site.