Looking beyond the next token
- URL: http://arxiv.org/abs/2504.11336v2
- Date: Thu, 24 Apr 2025 03:13:28 GMT
- Title: Looking beyond the next token
- Authors: Abitha Thankaraj, Yiding Jiang, J. Zico Kolter, Yonatan Bisk,
- Abstract summary: We argue that rearranging and processing the training data sequences can allow models to more accurately imitate the true data-generating process.<n>Our method naturally enables the generation of long-term goals at no additional cost.
- Score: 75.00751370502168
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The structure of causal language model training assumes that each token can be accurately predicted from the previous context. This contrasts with humans' natural writing and reasoning process, where goals are typically known before the exact argument or phrasings. While this mismatch has been well studied in the literature, the working assumption has been that architectural changes are needed to address this mismatch. We argue that rearranging and processing the training data sequences can allow models to more accurately imitate the true data-generating process, and does not require any other changes to the architecture or training infrastructure. We demonstrate that this technique, Trelawney, and the inference algorithms derived from it allow us to improve performance on several key benchmarks that span planning, algorithmic reasoning, and story generation tasks. Finally, our method naturally enables the generation of long-term goals at no additional cost. We investigate how using the model's goal-generation capability can further improve planning and reasoning. Additionally, we believe Trelawney could potentially open doors to new capabilities beyond the current language modeling paradigm.
Related papers
- Beyond Scaleup: Knowledge-aware Parsimony Learning from Deep Networks [47.6830995661091]
brute-force scaleup of training datasets, learnable parameters and computation power, has become a prevalent strategy for developing more robust learning models.<n>In this paper, we attempt to address this issue in a parsimonious manner, achieving greater potential with simpler models.<n>The key is to drive models using domain-specific knowledge, such as symbols, logic, and formulas, instead of purely relying on scaleup.
arXiv Detail & Related papers (2024-06-29T15:52:37Z) - Training Neural Networks with Internal State, Unconstrained
Connectivity, and Discrete Activations [66.53734987585244]
True intelligence may require the ability of a machine learning model to manage internal state.
We show that we have not yet discovered the most effective algorithms for training such models.
We present one attempt to design such a training algorithm, applied to an architecture with binary activations and only a single matrix of weights.
arXiv Detail & Related papers (2023-12-22T01:19:08Z) - Opening the Black Box: Analyzing Attention Weights and Hidden States in
Pre-trained Language Models for Non-language Tasks [0.8889304968879164]
We apply a pre-trained language model to constrained arithmetic problems with hierarchical structure, to analyze their attention weight scores and hidden states.
The investigation reveals promising results, with the model addressing hierarchical problems in a moderately structured manner, similar to human problem-solving strategies.
The attention analysis allows us to hypothesize that the model can generalize to longer sequences in ListOps dataset, a conclusion later confirmed through testing on sequences longer than those in the training set.
arXiv Detail & Related papers (2023-06-21T11:48:07Z) - PDSketch: Integrated Planning Domain Programming and Learning [86.07442931141637]
We present a new domain definition language, named PDSketch.
It allows users to flexibly define high-level structures in the transition models.
Details of the transition model will be filled in by trainable neural networks.
arXiv Detail & Related papers (2023-03-09T18:54:12Z) - Robust Graph Representation Learning via Predictive Coding [46.22695915912123]
Predictive coding is a message-passing framework initially developed to model information processing in the brain.
In this work, we build models that rely on the message-passing rule of predictive coding.
We show that the proposed models are comparable to standard ones in terms of performance in both inductive and transductive tasks.
arXiv Detail & Related papers (2022-12-09T03:58:22Z) - Autoregressive Structured Prediction with Language Models [73.11519625765301]
We describe an approach to model structures as sequences of actions in an autoregressive manner with PLMs.
Our approach achieves the new state-of-the-art on all the structured prediction tasks we looked at.
arXiv Detail & Related papers (2022-10-26T13:27:26Z) - Arithmetic-Based Pretraining -- Improving Numeracy of Pretrained
Language Models [67.48894919842576]
State-of-the-art pretrained language models tend to perform below their capabilities when applied out-of-the-box on tasks that require numeracy.
We propose a new extended pretraining approach called Arithmetic-Based Pretraining that jointly addresses both in one extended pretraining step.
Our experiments show the effectiveness of Arithmetic-Based Pretraining in three different tasks that require improved numeracy.
arXiv Detail & Related papers (2022-05-13T16:10:13Z) - Efficient Sub-structured Knowledge Distillation [52.5931565465661]
We propose an approach that is much simpler in its formulation and far more efficient for training than existing approaches.
We transfer the knowledge from a teacher model to its student model by locally matching their predictions on all sub-structures, instead of the whole output space.
arXiv Detail & Related papers (2022-03-09T15:56:49Z) - Updater-Extractor Architecture for Inductive World State Representations [0.0]
We propose a transformer-based Updater-Extractor architecture and a training procedure that can work with sequences of arbitrary length.
We explicitly train the model to incorporate incoming information into its world state representation.
Empirically, we investigate the model performance on three different tasks, demonstrating its promise.
arXiv Detail & Related papers (2021-04-12T14:30:11Z) - Unsupervised Paraphrasing with Pretrained Language Models [85.03373221588707]
We propose a training pipeline that enables pre-trained language models to generate high-quality paraphrases in an unsupervised setting.
Our recipe consists of task-adaptation, self-supervision, and a novel decoding algorithm named Dynamic Blocking.
We show with automatic and human evaluations that our approach achieves state-of-the-art performance on both the Quora Question Pair and the ParaNMT datasets.
arXiv Detail & Related papers (2020-10-24T11:55:28Z) - On the comparability of Pre-trained Language Models [0.0]
Recent developments in unsupervised representation learning have successfully established the concept of transfer learning in NLP.
More elaborated architectures are making better use of contextual information.
Larger corpora are used as resources for pre-training large language models in a self-supervised fashion.
Advances in parallel computing as well as in cloud computing made it possible to train these models with growing capacities in the same or even in shorter time than previously established models.
arXiv Detail & Related papers (2020-01-03T10:53:35Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.