Next Concept Prediction in Discrete Latent Space Leads to Stronger Language Models
- URL: http://arxiv.org/abs/2602.08984v1
- Date: Mon, 09 Feb 2026 18:33:31 GMT
- Title: Next Concept Prediction in Discrete Latent Space Leads to Stronger Language Models
- Authors: Yuliang Liu, Yunchong Song, Yixuan Wang, Kewen Ge, Alex Lamb, Qipeng Guo, Kai Chen, Bowen Zhou, Zhouhan Lin,
- Abstract summary: Next Concept Prediction is a generative pretraining paradigm built on top of Next Token Prediction.<n>Our model, ConceptLM, quantizes hidden states using Vector Quantization and constructs a concept vocabulary.<n>Results on 13 benchmarks show that NCP yields consistent performance gains over traditional token-level models.
- Score: 62.054835560934066
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We propose Next Concept Prediction (NCP), a generative pretraining paradigm built on top of Next Token Prediction (NTP). NCP predicts discrete concepts that span multiple tokens, thereby forming a more challenging pretraining objective. Our model, ConceptLM, quantizes hidden states using Vector Quantization and constructs a concept vocabulary. It leverages both NCP and NTP to drive parameter updates and generates a concept to guide the generation of the following tokens. We train ConceptLM from scratch at scales ranging from 70M to 1.5B parameters with up to 300B training data, including Pythia and GPT-2 backbones. Results on 13 benchmarks show that NCP yields consistent performance gains over traditional token-level models. Furthermore, continual pretraining experiments on an 8B-parameter Llama model indicate that NCP can further improve an NTP-trained model. Our analysis suggests that NCP leads to more powerful language models by introducing a harder pretraining task, providing a promising path toward better language modeling.
Related papers
- Reinforced Fast Weights with Next-Sequence Prediction [42.710296902935426]
REFINE is a reinforcement learning framework that trains fast weight models under the next-sequence prediction (NSP) objective.<n> REFINE consistently outperforms supervised fine-tuning with NTP across needle-in-a-haystack retrieval, long-context question answering, and diverse tasks in LongBench.
arXiv Detail & Related papers (2026-02-18T18:53:18Z) - Context-level Language Modeling by Learning Predictive Context Embeddings [79.00607069677393]
We introduce textbfContextLM, a framework that augments standard pretraining with an inherent textbfnext-context prediction objective.<n>This mechanism trains the model to learn predictive representations of multi-token contexts, leveraging error signals derived from future token chunks.<n>Experiments on the GPT2 and Pythia model families, scaled up to $1.5$B parameters, show that ContextLM delivers consistent improvements in both perplexity and downstream task performance.
arXiv Detail & Related papers (2025-10-23T07:09:45Z) - Beyond Multi-Token Prediction: Pretraining LLMs with Future Summaries [35.39150917025755]
Future summary prediction (FSP) trains an auxiliary head to predict a compact representation of the long-term future.<n>FSP provides improvements over both NTP and MTP across math, reasoning, and coding benchmarks.
arXiv Detail & Related papers (2025-10-16T14:52:52Z) - PonderLM-2: Pretraining LLM with Latent Thoughts in Continuous Space [44.24277388571869]
We propose a novel pre-training methodology: Pretraining Language Models with Latent Thoughts (PonderLM-2)<n>Our approach pretrains a language model (LM) to first generate an intermediate latent thought-the last hidden state of the current position-which is then used as input to predict the actual subsequent token.<n>Experiments demonstrate that, at an identical inference cost, a LM that generates one additional latent thought per token outperforms a standard model with double the parameters.
arXiv Detail & Related papers (2025-09-27T08:38:08Z) - Pretraining Language Models to Ponder in Continuous Space [50.52734567589996]
We introduce this pondering process into language models by repeatedly invoking the forward process within a single token generation step.<n>We show that the model can learn to ponder in this way through self-supervised learning, without any human annotations.
arXiv Detail & Related papers (2025-05-27T03:47:33Z) - LLM Pretraining with Continuous Concepts [71.98047075145249]
Next token prediction has been the standard training objective used in large language model pretraining.<n>We propose Continuous Concept Mixing (CoCoMix), a novel pretraining framework that combines discrete next token prediction with continuous concepts.
arXiv Detail & Related papers (2025-02-12T16:00:11Z) - Revisiting k-NN for Fine-tuning Pre-trained Language Models [25.105882538429743]
We revisit k-Nearest-Neighbor (kNN) classifiers for augmenting the PLMs-based classifiers.
At the heart of our approach is the implementation of kNN-calibrated training, which treats predicted results as indicators for easy versus hard examples.
We conduct extensive experiments on fine-tuning, prompt-tuning paradigms and zero-shot, few-shot and fully-supervised settings.
arXiv Detail & Related papers (2023-04-18T15:28:47Z) - A Kernel-Based View of Language Model Fine-Tuning [94.75146965041131]
We investigate whether the Neural Tangent Kernel (NTK) describes fine-tuning of pre-trained LMs.
We show that formulating the downstream task as a masked word prediction problem through prompting often induces kernel-based dynamics during fine-tuning.
arXiv Detail & Related papers (2022-10-11T17:34:32Z) - Unsupervised Paraphrasing with Pretrained Language Models [85.03373221588707]
We propose a training pipeline that enables pre-trained language models to generate high-quality paraphrases in an unsupervised setting.
Our recipe consists of task-adaptation, self-supervision, and a novel decoding algorithm named Dynamic Blocking.
We show with automatic and human evaluations that our approach achieves state-of-the-art performance on both the Quora Question Pair and the ParaNMT datasets.
arXiv Detail & Related papers (2020-10-24T11:55:28Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.