PonderLM-2: Pretraining LLM with Latent Thoughts in Continuous Space
- URL: http://arxiv.org/abs/2509.23184v2
- Date: Fri, 24 Oct 2025 06:43:56 GMT
- Title: PonderLM-2: Pretraining LLM with Latent Thoughts in Continuous Space
- Authors: Boyi Zeng, He Li, Shixiang Song, Yixuan Wang, Ziwei He, Xinbing Wang, Zhouhan Lin,
- Abstract summary: We propose a novel pre-training methodology: Pretraining Language Models with Latent Thoughts (PonderLM-2)<n>Our approach pretrains a language model (LM) to first generate an intermediate latent thought-the last hidden state of the current position-which is then used as input to predict the actual subsequent token.<n>Experiments demonstrate that, at an identical inference cost, a LM that generates one additional latent thought per token outperforms a standard model with double the parameters.
- Score: 44.24277388571869
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: The remarkable success of Chain-of-Thought (CoT), which enhances performance by scaling generation steps at test-time, inspires us to ask: can we leverage a similar scaling of computational steps during pretraining to improve the generation of each individual token? To address this, we propose a novel pre-training methodology: Pretraining Language Models with Latent Thoughts (PonderLM-2). Our approach pretrains a language model (LM) to first generate an intermediate latent thought-the last hidden state of the current position-which is then used as input to predict the actual subsequent token. This additional computational step enables the LM to refine its prediction within unconstrained continuous space. Our experiments demonstrate that, at an identical inference cost, a LM that generates one additional latent thought per token outperforms a standard model with double the parameters. For instance, our PonderLM-2-Pythia-1.4B, pretrained on 300B tokens from the Pile, significantly surpasses the vanilla Pythia-2.8B trained on the same data on both language modeling and a range of general downstream tasks. Furthermore, increasing the number of latent thoughts generated before each actual token-forming a chain analogous to CoT-consistently improves the model's performance.
Related papers
- Next Concept Prediction in Discrete Latent Space Leads to Stronger Language Models [62.054835560934066]
Next Concept Prediction is a generative pretraining paradigm built on top of Next Token Prediction.<n>Our model, ConceptLM, quantizes hidden states using Vector Quantization and constructs a concept vocabulary.<n>Results on 13 benchmarks show that NCP yields consistent performance gains over traditional token-level models.
arXiv Detail & Related papers (2026-02-09T18:33:31Z) - Continuous Autoregressive Language Models [56.49239051750678]
We introduce Continuous Autoregressive Language Models (CALM)<n>CALM uses a high-fidelity autoencoder to compress a chunk of K tokens into a single continuous vector.<n>We develop a comprehensive likelihood-free framework that enables robust training, evaluation, and controllable sampling.
arXiv Detail & Related papers (2025-10-31T17:58:11Z) - Post-Completion Learning for Language Models [20.589364712188015]
Current language model training paradigms terminate learning upon reaching the end-of-sequence (eos>) token.<n>We propose Post-Completion Learning (PCL), a novel training framework that systematically utilizes the sequence space after model output completion.<n>PCL enables models to continue generating self-assessments and reward predictions during training, while maintaining efficient inference by stopping at the completion point.
arXiv Detail & Related papers (2025-07-27T12:47:26Z) - Pretraining Language Models to Ponder in Continuous Space [50.52734567589996]
We introduce this pondering process into language models by repeatedly invoking the forward process within a single token generation step.<n>We show that the model can learn to ponder in this way through self-supervised learning, without any human annotations.
arXiv Detail & Related papers (2025-05-27T03:47:33Z) - Text Generation Beyond Discrete Token Sampling [75.96920867382859]
Mixture of Inputs (MoI) is a training-free method for autoregressive generation.<n>MoI consistently improves performance across multiple models including QwQ-32B, Nemotron-Super-49B, Gemma-3-27B, and DAPO-Qwen-32B.
arXiv Detail & Related papers (2025-05-20T18:41:46Z) - Bridging the Training-Inference Gap in LLMs by Leveraging Self-Generated Tokens [45.745443096804586]
Language models are often trained to maximize the likelihood of the next token given past tokens in the training dataset.<n>During inference time, they are utilized differently, generating text sequentially and auto-regressively by using previously generated tokens as input to predict the next one.<n>This paper proposes two simple approaches based on model own generation to address this discrepancy between the training and inference time.
arXiv Detail & Related papers (2024-10-18T17:48:27Z) - Unsupervised Paraphrasing with Pretrained Language Models [85.03373221588707]
We propose a training pipeline that enables pre-trained language models to generate high-quality paraphrases in an unsupervised setting.
Our recipe consists of task-adaptation, self-supervision, and a novel decoding algorithm named Dynamic Blocking.
We show with automatic and human evaluations that our approach achieves state-of-the-art performance on both the Quora Question Pair and the ParaNMT datasets.
arXiv Detail & Related papers (2020-10-24T11:55:28Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.