Related papers: Modeling Language as a Sequence of Thoughts

Modeling Language as a Sequence of Thoughts

URL: http://arxiv.org/abs/2512.25026v1
Date: Wed, 31 Dec 2025 18:24:57 GMT
Title: Modeling Language as a Sequence of Thoughts
Authors: Nasim Borazjanizadeh, James McClelland,
Abstract summary: Transformer language models can generate strikingly natural text by modeling language as a sequence of tokens.<n>Yet, they fail to form globally consistent latent representations of entities and events, lack of which contributes to brittleness in relational direction (e.g., reversal curse), contextualization errors, and data inefficiency.<n>We introduce Thought Gestalt (TG) model, a recurrent Transformer that models language at two levels of abstraction - tokens and sentence-level "thought" states.
Score: 0.0
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Transformer language models can generate strikingly natural text by modeling language as a sequence of tokens. Yet, by relying primarily on surface-level co-occurrence statistics, they fail to form globally consistent latent representations of entities and events, lack of which contributes to brittleness in relational direction (e.g., reversal curse), contextualization errors, and data inefficiency. On the other hand, cognitive science shows that human comprehension involves converting the input linguistic stream into compact, event-like representations that persist in memory while verbatim form is short-lived. Motivated by this view, we introduce Thought Gestalt (TG) model, a recurrent Transformer that models language at two levels of abstraction - tokens and sentence-level "thought" states. TG generates the tokens of one sentence at a time while cross-attending to a memory of prior sentence representations. In TG, token and sentence representations are generated using the same set of model parameters and trained with a single objective, the next-token cross-entropy: by retaining the computation graph of sentence representations written to memory, gradients from future token losses flow backward through cross-attention to optimize the parameters generating earlier sentence vectors. In scaling experiments, TG consistently improves efficiency over matched GPT-2 runs, among other baselines, with scaling fits indicating GPT-2 requires ~5-8% more data and ~33-42% more parameters to match TG's loss. TG also reduces errors on relational direction generalization on a father-son reversal curse probe.

Related papers

Idea-Gated Transformers: Enforcing Semantic Coherence via Differentiable Vocabulary Pruning [0.40611352512781856]
We introduce the Idea-Gated Transformer, a novel architecture that separates semantic planning from syntactic generation.<n>We propose a differentiable gating mechanism that suppresses semantically irrelevant tokens, effectively pruning the search space in real-time.
arXiv Detail & Related papers (2025-12-03T01:17:07Z)
REAR: Rethinking Visual Autoregressive Models via Generator-Tokenizer Consistency Regularization [130.46612643194973]
reAR is a simple training strategy introducing a token-wise regularization objective.<n>On ImageNet, it reduces gFID from 3.02 to 1.86 and improves IS to 316.9 using a standardization-based tokenizer.<n>When applied to advanced tokenizers, it achieves a gFID of 1.42 with only 177M parameters, matching the performance with larger state-of-the-art diffusion models (675M)
arXiv Detail & Related papers (2025-10-06T02:48:13Z)
Token-Shuffle: Towards High-Resolution Image Generation with Autoregressive Models [92.18057318458528]
Token-Shuffle is a novel method that reduces the number of image tokens in Transformer.<n>Our strategy requires no additional pretrained text-encoder and enables MLLMs to support extremely high-resolution image synthesis.<n>In GenAI-benchmark, our 2.7B model achieves 0.77 overall score on hard prompts, outperforming AR models LlamaGen by 0.18 and diffusion models LDM by 0.15.
arXiv Detail & Related papers (2025-04-24T17:59:56Z)
Test-Time Alignment for Large Language Models via Textual Model Predictive Control [63.508812485566374]
Textual Model Predictive Control (TMPC) is a novel predictive planning framework adapted for aligning Large Language Models at inference time.<n>TMPC is evaluated on three tasks with distinct segmentation properties: discourse-level translation, long-form response generation, and program synthesis.<n>Results demonstrate that TMPC consistently improves performance, highlighting the generality.
arXiv Detail & Related papers (2025-02-28T07:24:33Z)
Future Token Prediction -- Causal Language Modelling with Per-Token Semantic State Vector for Multi-Token Prediction [0.0]
This research investigates a new pretraining method called Future Token Prediction (FTP) FTP generates embedding vectors for each token position that are linearly and expansively projected to a pseudo-sequence. On a toy, but complex, coding problem, FTP networks produce significantly better results than GPT networks.
arXiv Detail & Related papers (2024-10-23T14:50:15Z)
Why Does Surprisal From Larger Transformer-Based Language Models Provide a Poorer Fit to Human Reading Times? [9.909170013118775]
The propensity of larger Transformer-based models to'memorize' sequences during training makes their surprisal estimates diverge from humanlike expectations. These results suggest that the propensity of larger Transformer-based models to'memorize' sequences during training makes their surprisal estimates diverge from humanlike expectations.
arXiv Detail & Related papers (2022-12-23T03:57:54Z)
HETFORMER: Heterogeneous Transformer with Sparse Attention for Long-Text Extractive Summarization [57.798070356553936]
HETFORMER is a Transformer-based pre-trained model with multi-granularity sparse attentions for extractive summarization. Experiments on both single- and multi-document summarization tasks show that HETFORMER achieves state-of-the-art performance in Rouge F1.
arXiv Detail & Related papers (2021-10-12T22:42:31Z)
Scaling Laws for Neural Machine Translation [21.76567580425173]
We show that cross-entropy loss as a function of model size follows a certain scaling law. We also investigate the relationship between the cross-entropy loss and the quality of the translations generated.
arXiv Detail & Related papers (2021-09-16T06:15:20Z)
Tail-to-Tail Non-Autoregressive Sequence Prediction for Chinese Grammatical Error Correction [49.25830718574892]
We present a new framework named Tail-to-Tail (textbfTtT) non-autoregressive sequence prediction. Considering that most tokens are correct and can be conveyed directly from source to target, and the error positions can be estimated and corrected. Experimental results on standard datasets, especially on the variable-length datasets, demonstrate the effectiveness of TtT in terms of sentence-level Accuracy, Precision, Recall, and F1-Measure.
arXiv Detail & Related papers (2021-06-03T05:56:57Z)
GTAE: Graph-Transformer based Auto-Encoders for Linguistic-Constrained Text Style Transfer [119.70961704127157]
Non-parallel text style transfer has attracted increasing research interests in recent years. Current approaches still lack the ability to preserve the content and even logic of original sentences. We propose a method called Graph Transformer based Auto-GTAE, which models a sentence as a linguistic graph and performs feature extraction and style transfer at the graph level.
arXiv Detail & Related papers (2021-02-01T11:08:45Z)
Stacked DeBERT: All Attention in Incomplete Data for Text Classification [8.900866276512364]
We propose Stacked DeBERT, short for Stacked Denoising Bidirectional Representations from Transformers. Our model shows improved F1-scores and better robustness in informal/incorrect texts present in tweets and in texts with Speech-to-Text error in sentiment and intent classification tasks.
arXiv Detail & Related papers (2020-01-01T04:49:23Z)

This list is automatically generated from the titles and abstracts of the papers in this site.