A Causal World Model Underlying Next Token Prediction: Exploring GPT in a Controlled Environment
- URL: http://arxiv.org/abs/2412.07446v4
- Date: Sun, 06 Jul 2025 09:43:56 GMT
- Title: A Causal World Model Underlying Next Token Prediction: Exploring GPT in a Controlled Environment
- Authors: Raanan Y. Rohekar, Yaniv Gurwicz, Sungduk Yu, Estelle Aflalo, Vasudev Lal,
- Abstract summary: generative pre-trained transformer (GPT) models, trained only to predict the next token, implicitly learning a world model from which sequences are generated one token at a time.<n>We find that GPT model is likely to generate legal next moves for out-of-distribution sequences for which a causal structure is encoded in the attention mechanism with high confidence.<n>In cases where it generates illegal moves, it also fails to capture a causal structure.
- Score: 5.156443267442059
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Are generative pre-trained transformer (GPT) models, trained only to predict the next token, implicitly learning a world model from which sequences are generated one token at a time? We address this question by deriving a causal interpretation of the attention mechanism in GPT and presenting a causal world model that arises from this interpretation. Furthermore, we propose that GPT models, at inference time, can be utilized for zero-shot causal structure learning for input sequences, and introduce a corresponding confidence score. Empirical tests were conducted in controlled environments using the setups of the Othello and Chess strategy games. A GPT, pre-trained on real-world games played with the intention of winning, was tested on out-of-distribution synthetic data consisting of sequences of random legal moves. We find that the GPT model is likely to generate legal next moves for out-of-distribution sequences for which a causal structure is encoded in the attention mechanism with high confidence. In cases where it generates illegal moves, it also fails to capture a causal structure.
Related papers
- ROBAD: Robust Adversary-aware Local-Global Attended Bad Actor Detection Sequential Model [21.007806717655146]
ROBAD uses the sequence of user posts to generate user embedding to detect bad actors.<n> experiments on Yelp and Wikipedia show that ROBAD can effectively detect bad actors when under state-of-the-art adversarial attacks.
arXiv Detail & Related papers (2025-07-20T18:03:44Z) - Adversarial Manipulation of Reasoning Models using Internal Representations [1.024113475677323]
We identify a linear direction in activation space during CoT token generation that predicts whether the model will refuse or comply.<n>We show that intervening only on CoT token activations suffices to control final outputs, and that incorporating this direction into prompt-based attacks improves success rates.<n>Our findings suggest that the chain-of-thought itself is a promising new target for adversarial manipulation in reasoning models.
arXiv Detail & Related papers (2025-07-03T20:51:32Z) - Selective Temporal Knowledge Graph Reasoning [70.11788354442218]
Temporal Knowledge Graph (TKG) aims to predict future facts based on given historical ones.
Existing TKG reasoning models are unable to abstain from predictions they are uncertain.
We propose an abstention mechanism for TKG reasoning, which helps the existing models make selective, instead of indiscriminate, predictions.
arXiv Detail & Related papers (2024-04-02T06:56:21Z) - Emergent World Models and Latent Variable Estimation in Chess-Playing Language Models [0.0]
We train a GPT model on Othello games and find that the model learned an internal representation of the board state.
We extend this work into the more complex domain of chess, training on real games and investigating our model's internal representations.
Unlike Li et al.'s prior synthetic dataset approach, our analysis finds that the model also learns to estimate latent variables like player skill to better predict the next character.
arXiv Detail & Related papers (2024-03-21T18:53:23Z) - When Does Confidence-Based Cascade Deferral Suffice? [69.28314307469381]
Cascades are a classical strategy to enable inference cost to vary adaptively across samples.
A deferral rule determines whether to invoke the next classifier in the sequence, or to terminate prediction.
Despite being oblivious to the structure of the cascade, confidence-based deferral often works remarkably well in practice.
arXiv Detail & Related papers (2023-07-06T04:13:57Z) - Learning and Leveraging Verifiers to Improve Planning Capabilities of
Pre-trained Language Models [20.13307800821161]
We empirically demonstrate that the performance of a finetuned baseline remains poor because it violates pre-conditions of actions in the plans that it generates.
To improve the planning capabilities of a finetuned LLM, we train a verifier, which can classify actions as being valid or invalid in a particular state.
In the presence of diverse sampling from a generator and a verifier, we show significant gains in the success rate on the Blocksworld domain.
arXiv Detail & Related papers (2023-05-26T16:36:55Z) - Physics of Language Models: Part 1, Learning Hierarchical Language Structures [51.68385617116854]
Transformer-based language models are effective but complex, and understanding their inner workings is a significant challenge.
We introduce a family of synthetic CFGs that produce hierarchical rules, capable of generating lengthy sentences.
We demonstrate that generative models like GPT can accurately learn this CFG language and generate sentences based on it.
arXiv Detail & Related papers (2023-05-23T04:28:16Z) - Causally Disentangled Generative Variational AutoEncoder [16.82544099843568]
We present a new supervised learning technique for the Variational AutoEncoder (VAE)
This technique allows it to learn a causally disentangled representation and generate causally disentangled outcomes simultaneously.
We call this approach Causally Disentangled Generation (CDG)
arXiv Detail & Related papers (2023-02-23T01:57:09Z) - Debiased Fine-Tuning for Vision-language Models by Prompt Regularization [50.41984119504716]
We present a new paradigm for fine-tuning large-scale vision pre-trained models on downstream task, dubbed Prompt Regularization (ProReg)
ProReg uses the prediction by prompting the pretrained model to regularize the fine-tuning.
We show the consistently strong performance of ProReg compared with conventional fine-tuning, zero-shot prompt, prompt tuning, and other state-of-the-art methods.
arXiv Detail & Related papers (2023-01-29T11:53:55Z) - Emergent World Representations: Exploring a Sequence Model Trained on a Synthetic Task [75.35278593566068]
Language models show a surprising range of capabilities, but the source of their apparent competence is unclear.
Do these networks just memorize a collection of surface statistics, or do they rely on internal representations of the process that generates the sequences they see?
We investigate this question by applying a variant of the GPT model to the task of predicting legal moves in a simple board game, Othello.
arXiv Detail & Related papers (2022-10-24T16:29:55Z) - Scoring Rules for Performative Binary Prediction [2.111790330664657]
We show through theoretical and numerical results that proper scoring rules can incentivize experts to manipulate the world with their predictions.
We also construct a simple class of scoring rules that avoids this problem.
arXiv Detail & Related papers (2022-07-05T08:31:24Z) - Toward Certified Robustness Against Real-World Distribution Shifts [65.66374339500025]
We train a generative model to learn perturbations from data and define specifications with respect to the output of the learned model.
A unique challenge arising from this setting is that existing verifiers cannot tightly approximate sigmoid activations.
We propose a general meta-algorithm for handling sigmoid activations which leverages classical notions of counter-example-guided abstraction refinement.
arXiv Detail & Related papers (2022-06-08T04:09:13Z) - How Should Pre-Trained Language Models Be Fine-Tuned Towards Adversarial
Robustness? [121.57551065856164]
We propose Robust Informative Fine-Tuning (RIFT) as a novel adversarial fine-tuning method from an information-theoretical perspective.
RIFT encourages an objective model to retain the features learned from the pre-trained model throughout the entire fine-tuning process.
Experimental results show that RIFT consistently outperforms the state-of-the-arts on two popular NLP tasks.
arXiv Detail & Related papers (2021-12-22T05:04:41Z) - Delayed Propagation Transformer: A Universal Computation Engine towards
Practical Control in Cyber-Physical Systems [68.75717332928205]
Multi-agent control is a central theme in the Cyber-Physical Systems.
This paper presents a new transformer-based model that specializes in the global modeling of CPS.
With physical constraint inductive bias baked into its design, our DePT is ready to plug and play for a broad class of multi-agent systems.
arXiv Detail & Related papers (2021-10-29T17:20:53Z) - Kronecker Decomposition for GPT Compression [8.60086973058282]
GPT is an auto-regressive Transformer-based pre-trained language model which has attracted a lot of attention in the natural language processing (NLP) domain.
Despite the superior performance of GPT, GPT can be very prohibitive for deploying this model on devices with limited computational power or memory.
In this work, we use Kronecker decomposition to compress the linear mappings of the GPT-22 model.
arXiv Detail & Related papers (2021-10-15T15:28:39Z) - CC-Cert: A Probabilistic Approach to Certify General Robustness of
Neural Networks [58.29502185344086]
In safety-critical machine learning applications, it is crucial to defend models against adversarial attacks.
It is important to provide provable guarantees for deep learning models against semantically meaningful input transformations.
We propose a new universal probabilistic certification approach based on Chernoff-Cramer bounds.
arXiv Detail & Related papers (2021-09-22T12:46:04Z) - Adversarial Example Games [51.92698856933169]
Adrial Example Games (AEG) is a framework that models the crafting of adversarial examples.
AEG provides a new way to design adversarial examples by adversarially training a generator and aversa from a given hypothesis class.
We demonstrate the efficacy of AEG on the MNIST and CIFAR-10 datasets.
arXiv Detail & Related papers (2020-07-01T19:47:23Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.