Thoughtbubbles: an Unsupervised Method for Parallel Thinking in Latent Space
- URL: http://arxiv.org/abs/2510.00219v1
- Date: Tue, 30 Sep 2025 19:49:15 GMT
- Title: Thoughtbubbles: an Unsupervised Method for Parallel Thinking in Latent Space
- Authors: Houjun Liu, Shikhar Murty, Christopher D. Manning, Róbert Csordás,
- Abstract summary: Current approaches for scaling inference-time compute in computation transformers rely on training them to emit explicit chain-of-thought tokens before producing an answer.<n>Thoughtbubbles is a transformer variant that performs parallel adaptive computation in latent space by learning to fork or delete residual streams.<n>Thoughtbubbles outperforms both standard decoder LMs and non-adaptive parallel computation approaches on OpenWebText and peS2o perplexity and in zero-shot evaluations such as HellaSwag and LAMBADA.
- Score: 38.50132130644233
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Current approaches for scaling inference-time compute in transformers rely on training them to emit explicit chain-of-thought tokens before producing an answer. While these methods are powerful, they are limited because they cannot be applied during pretraining and are limited to only serially-generated, natural-language verbalization to scale inference-time compute. In this work, we propose Thoughtbubbles, a transformer variant that natively performs parallel adaptive computation in latent space by learning to fork or delete residual streams. Thus, tokens that require a large amount of computation can form a "bubble" of cloned residuals in the middle of the network for additional thinking. Crucially, this behavior is learned during pretraining with only language modeling loss. Thoughtbubbles outperforms both standard decoder LMs as well as non-adaptive parallel computation approaches on OpenWebText and peS2o perplexity and in zero-shot evaluations such as HellaSwag and LAMBADA after pretraining across 150M to 772M parameter scales. The implicit nature of our method enables adaptive computation to be learned starting at pretraining time, paving the way to unify train and test-time behavior for reasoning models.
Related papers
- AdaPonderLM: Gated Pondering Language Models with Token-Wise Adaptive Depth [23.442686851761298]
AdaPonderLM is a self-supervised recurrent language model that learns token-wise early exiting during pretraining.<n>AdaPonderLM reduces inference compute at about 10% while maintaining comparable language modeling perplexity and competitive downstream accuracy.
arXiv Detail & Related papers (2026-03-02T14:28:16Z) - Pretraining with Token-Level Adaptive Latent Chain-of-Thought [44.19871205975474]
Scaling large language models by increasing parameters and training data is increasingly constrained by limited high-quality corpora and rising communication costs.<n>This work explores an alternative axis: increasing per-token computation without expanding parameters, by internalizing latent Chain-of-Thought (CoT) into pretraining.<n>We propose Pretraining with Token-Level Adaptive Latent CoT (adaptive latent CoT), where the model generates a variable-length latent CoT trajectory before emitting each token.<n>Experiments with Llama architectures show that adaptive latent CoT consistently improves language modeling perplexity and broad downstream accuracy, even with fewer training FL
arXiv Detail & Related papers (2026-02-09T02:49:15Z) - Semantic Soft Bootstrapping: Long Context Reasoning in LLMs without Reinforcement Learning [46.765013720309064]
Long context reasoning in large language models (LLMs) has demonstrated enhancement of their cognitive capabilities via chain-of-thought (CoT) inference.<n>Training such models is usually done via reinforcement learning with verifiable rewards (RLVR) in reasoning based problems, like math and programming.<n>We propose textbfSemantic Soft Bootstrapping ( SSB), a self-distillation technique, in which the same base language model plays the role of both teacher and student, but receives different semantic contexts about the correctness of its outcome at training time.
arXiv Detail & Related papers (2025-12-04T18:59:18Z) - Pretraining LLM with Latent Thoughts in Continuous Space [44.24277388571869]
We propose a novel pre-training methodology: Pretraining Language Models with Latent Thoughts.<n>Our approach pretrains a language model (LM) to first generate an intermediate latent thought-the last hidden state of the current position.<n>We show that, at an identical inference cost, a LM that generates one additional latent thought per token outperforms a standard model with double the parameters.
arXiv Detail & Related papers (2025-09-27T08:38:08Z) - Blockwise SFT for Diffusion Language Models: Reconciling Bidirectional Attention and Autoregressive Decoding [60.06816407728172]
Discrete diffusion language models have shown strong potential for text generation.<n>Standard supervised fine-tuning misaligns with semi-autoregressive inference.<n>We propose Blockwise SFT, which partitions responses into fixed-size blocks.
arXiv Detail & Related papers (2025-08-27T02:49:33Z) - ESLM: Risk-Averse Selective Language Modeling for Efficient Pretraining [53.893792844055106]
Large language model pretraining is compute-intensive, yet many tokens contribute marginally to learning, resulting in inefficiency.<n>We introduce Selective Efficient Language Modeling, a risk-aware algorithm that improves training efficiency and distributional robustness by performing online token-level batch selection.<n> Experiments on GPT-2 pretraining show that ESLM significantly reduces training FLOPs while maintaining or improving both perplexity and downstream performance compared to baselines.
arXiv Detail & Related papers (2025-05-26T12:23:26Z) - Fast Controlled Generation from Language Models with Adaptive Weighted Rejection Sampling [90.86991492288487]
evaluating constraint on every token can be prohibitively expensive.<n> LCD can distort the global distribution over strings, sampling tokens based only on local information.<n>We show that our approach is superior to state-of-the-art baselines.
arXiv Detail & Related papers (2025-04-07T18:30:18Z) - Scaling up Test-Time Compute with Latent Reasoning: A Recurrent Depth Approach [70.44265766483633]
We study a novel language model architecture that is capable of scaling test-time computation by implicitly reasoning in latent space.<n>Our model works by iterating a recurrent block, thereby unrolling to arbitrary depth at test-time.<n>We show that the resulting model can improve its performance on reasoning benchmarks, sometimes dramatically.
arXiv Detail & Related papers (2025-02-07T18:55:02Z) - Bridging the Training-Inference Gap in LLMs by Leveraging Self-Generated Tokens [45.745443096804586]
Language models are often trained to maximize the likelihood of the next token given past tokens in the training dataset.<n>During inference time, they are utilized differently, generating text sequentially and auto-regressively by using previously generated tokens as input to predict the next one.<n>This paper proposes two simple approaches based on model own generation to address this discrepancy between the training and inference time.
arXiv Detail & Related papers (2024-10-18T17:48:27Z) - Mixture-of-Depths: Dynamically allocating compute in transformer-based language models [8.774705201394916]
Transformer-based language models spread FLOPs uniformly across input sequences.
We show that transformers can learn to dynamically allocate FLOPs to specific positions in a sequence.
arXiv Detail & Related papers (2024-04-02T19:28:11Z) - Just One Byte (per gradient): A Note on Low-Bandwidth Decentralized
Language Model Finetuning Using Shared Randomness [86.61582747039053]
Language model training in distributed settings is limited by the communication cost of exchanges.
We extend recent work using shared randomness to perform distributed fine-tuning with low bandwidth.
arXiv Detail & Related papers (2023-06-16T17:59:51Z) - Lexically Constrained Neural Machine Translation with Levenshtein
Transformer [8.831954614241234]
This paper proposes a simple and effective algorithm for incorporating lexical constraints in neural machine translation.
Our method injects terminology constraints at inference time without any impact on decoding speed.
arXiv Detail & Related papers (2020-04-27T09:59:27Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.