Next-Latent Prediction Transformers Learn Compact World Models
- URL: http://arxiv.org/abs/2511.05963v1
- Date: Sat, 08 Nov 2025 10:41:26 GMT
- Title: Next-Latent Prediction Transformers Learn Compact World Models
- Authors: Jayden Teoh, Manan Tomar, Kwangjun Ahn, Edward S. Hu, Pratyusha Sharma, Riashat Islam, Alex Lamb, John Langford,
- Abstract summary: Next-Latent Prediction extends standard next-token training with self-supervised predictions in the latent space.<n>NextLat demonstrates significant gains over standard next-token training in downstream accuracy, representation compression, and lookahead planning.
- Score: 33.499164089236444
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Transformers replace recurrence with a memory that grows with sequence length and self-attention that enables ad-hoc look ups over past tokens. Consequently, they lack an inherent incentive to compress history into compact latent states with consistent transition rules. This often leads to learning solutions that generalize poorly. We introduce Next-Latent Prediction (NextLat), which extends standard next-token training with self-supervised predictions in the latent space. Specifically, NextLat trains a transformer to learn latent representations that are predictive of its next latent state given the next output token. Theoretically, we show that these latents provably converge to belief states, compressed information of the history necessary to predict the future. This simple auxiliary objective also injects a recurrent inductive bias into transformers, while leaving their architecture, parallel training, and inference unchanged. NextLat effectively encourages the transformer to form compact internal world models with its own belief states and transition dynamics -- a crucial property absent in standard next-token prediction transformers. Empirically, across benchmarks targeting core sequence modeling competencies -- world modeling, reasoning, planning, and language modeling -- NextLat demonstrates significant gains over standard next-token training in downstream accuracy, representation compression, and lookahead planning. NextLat stands as a simple and efficient paradigm for shaping transformer representations toward stronger generalization.
Related papers
- Incremental Learning of Sparse Attention Patterns in Transformers [29.54151079577767]
This paper introduces a high-order Markov chain task to investigate how transformers learn to integrate information from multiple past positions.<n>We identify a shift in learning dynamics from competitive, where heads converge on the most statistically dominant pattern, to cooperative, where heads specialize in distinct patterns.
arXiv Detail & Related papers (2026-02-22T12:16:06Z) - HT-Transformer: Event Sequences Classification by Accumulating Prefix Information with History Tokens [1.534667887016089]
We introduce history tokens, a novel concept that facilitates the accumulation of historical information during prediction pretraining.<n>Our approach significantly improves transformer-based models, achieving impressive results in finance, e-commerce, and healthcare tasks.
arXiv Detail & Related papers (2025-08-02T19:50:58Z) - Bottlenecked Transformers: Periodic KV Cache Consolidation for Generalised Reasoning [16.35681450323654]
Transformer LLMs have been shown to exhibit strong reasoning ability that scales with inference-time compute.<n>We give a theoretical justification as to why memory (re)consolidation via KV cache rewrites is beneficial for improved reasoning.<n>Our model sees consistent performance gains over vanilla Transformers and pause-token augmented baselines, with gains of up to +6.6pp for selected tasks/backbones.
arXiv Detail & Related papers (2025-05-22T17:33:49Z) - Moving Beyond Next-Token Prediction: Transformers are Context-Sensitive Language Generators [0.40792653193642503]
Large Language Models (LLMs) powered by Transformers have demonstrated human-like intelligence capabilities.<n>This paper presents a novel framework for interpreting LLMs as probabilistic left context-sensitive languages (CSLs) generators.
arXiv Detail & Related papers (2025-04-15T04:06:27Z) - The Role of Sparsity for Length Generalization in Transformers [58.65997625433689]
We propose a new theoretical framework to study length generalization for the next-token prediction task.<n>We show that length generalization occurs as long as each predicted token depends on a small (fixed) number of previous tokens.<n>We introduce Predictive Position Coupling, which trains the transformer to predict the position IDs used in a positional coupling approach.
arXiv Detail & Related papers (2025-02-24T03:01:03Z) - Curse of Attention: A Kernel-Based Perspective for Why Transformers Fail to Generalize on Time Series Forecasting and Beyond [17.002793355495136]
We propose the first theoretical explanation of the inefficiency of transformers on TSF tasks.<n>We attribute the mechanism behind it to bf Asymmetric Learning in training attention networks.
arXiv Detail & Related papers (2024-12-08T20:29:06Z) - The Belief State Transformer [51.840276930729516]
"Belief State Transformer" is a next-token predictor that takes both a prefix and suffix as inputs.<n>It effectively learns to solve challenging problems that conventional forward-only transformers struggle with.<n> Empirical ablations show that each component of the model is essential in difficult scenarios where standard Transformers fall short.
arXiv Detail & Related papers (2024-10-30T23:26:06Z) - Local to Global: Learning Dynamics and Effect of Initialization for Transformers [20.02103237675619]
We focus on first-order Markov chains and single-layer transformers.
We prove that transformer parameters trained on next-token prediction loss can either converge to global or local minima.
arXiv Detail & Related papers (2024-06-05T08:57:41Z) - How do Transformers perform In-Context Autoregressive Learning? [76.18489638049545]
We train a Transformer model on a simple next token prediction task.
We show how a trained Transformer predicts the next token by first learning $W$ in-context, then applying a prediction mapping.
arXiv Detail & Related papers (2024-02-08T16:24:44Z) - Emergent Agentic Transformer from Chain of Hindsight Experience [96.56164427726203]
We show that a simple transformer-based model performs competitively with both temporal-difference and imitation-learning-based approaches.
This is the first time that a simple transformer-based model performs competitively with both temporal-difference and imitation-learning-based approaches.
arXiv Detail & Related papers (2023-05-26T00:43:02Z) - What Makes for Good Tokenizers in Vision Transformer? [62.44987486771936]
transformers are capable of extracting their pairwise relationships using self-attention.
What makes for a good tokenizer has not been well understood in computer vision.
Modulation across Tokens (MoTo) incorporates inter-token modeling capability through normalization.
Regularization objective TokenProp is embraced in the standard training regime.
arXiv Detail & Related papers (2022-12-21T15:51:43Z) - Addressing Some Limitations of Transformers with Feedback Memory [51.94640029417114]
Transformers have been successfully applied to sequential, auto-regressive tasks despite being feedforward networks.
We propose the Feedback Transformer architecture that exposes all previous representations to all future representations.
We demonstrate on a variety of benchmarks in language modeling, machine translation, and reinforcement learning that the increased representation capacity can create small, shallow models with much stronger performance than comparable Transformers.
arXiv Detail & Related papers (2020-02-21T16:37:57Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.