Related papers: Meanings and Feelings of Large Language Models: Observability of Latent States in Generative AI

Meanings and Feelings of Large Language Models: Observability of Latent States in Generative AI

URL: http://arxiv.org/abs/2405.14061v1
Date: Wed, 22 May 2024 23:18:58 GMT
Title: Meanings and Feelings of Large Language Models: Observability of Latent States in Generative AI
Authors: Tian Yu Liu, Stefano Soatto, Matteo Marchi, Pratik Chaudhari, Paulo Tabuada,
Abstract summary: We show that current Large Language Models (LLMs) cannot have 'feelings' according to the American Psychological Association (APA) Our analysis sheds light on possible designs that would enable a model to perform non-trivial computation that is not visible to the user.
Score: 65.04274914674771
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: We tackle the question of whether Large Language Models (LLMs), viewed as dynamical systems with state evolving in the embedding space of symbolic tokens, are observable. That is, whether there exist multiple 'mental' state trajectories that yield the same sequence of generated tokens, or sequences that belong to the same Nerode equivalence class ('meaning'). If not observable, mental state trajectories ('experiences') evoked by an input ('perception') or by feedback from the model's own state ('thoughts') could remain self-contained and evolve unbeknown to the user while being potentially accessible to the model provider. Such "self-contained experiences evoked by perception or thought" are akin to what the American Psychological Association (APA) defines as 'feelings'. Beyond the lexical curiosity, we show that current LLMs implemented by autoregressive Transformers cannot have 'feelings' according to this definition: The set of state trajectories indistinguishable from the tokenized output is a singleton. But if there are 'system prompts' not visible to the user, then the set of indistinguishable trajectories becomes non-trivial, and there can be multiple state trajectories that yield the same verbalized output. We prove these claims analytically, and show examples of modifications to standard LLMs that engender such 'feelings.' Our analysis sheds light on possible designs that would enable a model to perform non-trivial computation that is not visible to the user, as well as on controls that the provider of services using the model could take to prevent unintended behavior.

Related papers

Decomposing Prediction Mechanisms for In-Context Recall [4.148170164455114]
We introduce a new family of toy problems that combine features of linear-regression-style continuous in-context learning (ICL) with discrete associative recall.<n>We study if the transformer models can recall the state of a sequence previously seen in its context when prompted to do so with the corresponding in-context label.
arXiv Detail & Related papers (2025-07-02T07:09:09Z)
Landscape of Thoughts: Visualizing the Reasoning Process of Large Language Models [42.407188124841234]
Landscape of thoughts is a tool to inspect the reasoning paths of chain-of-thought on any multi-choice dataset. It distinguishes between strong and weak models, correct and incorrect answers, as well as different reasoning tasks. It also uncovers undesirable reasoning patterns, such as low consistency and high uncertainty.
arXiv Detail & Related papers (2025-03-28T06:09:51Z)
I Predict Therefore I Am: Is Next Token Prediction Enough to Learn Human-Interpretable Concepts from Data? [76.15163242945813]
Large language models (LLMs) have led many to conclude that they exhibit a form of intelligence.<n>We introduce a novel generative model that generates tokens on the basis of human-interpretable concepts represented as latent discrete variables.
arXiv Detail & Related papers (2025-03-12T01:21:17Z)
Identifiable Steering via Sparse Autoencoding of Multi-Concept Shifts [11.81523319216474]
Steering methods manipulate the representations of large language models (LLMs) to induce responses that have desired properties. Traditionally, steering has relied on supervision, such as from contrastive pairs of prompts that vary in a single target concept. We introduce Sparse Shift Autoencoders (SSAEs) that instead map the differences between embeddings to sparse representations.
arXiv Detail & Related papers (2025-02-14T08:49:41Z)
Predicting the Performance of Black-box LLMs through Self-Queries [60.87193950962585]
Large language models (LLMs) are increasingly relied on in AI systems, predicting when they make mistakes is crucial. In this paper, we extract features of LLMs in a black-box manner by using follow-up prompts and taking the probabilities of different responses as representations. We demonstrate that training a linear model on these low-dimensional representations produces reliable predictors of model performance at the instance level.
arXiv Detail & Related papers (2025-01-02T22:26:54Z)
Tokens, the oft-overlooked appetizer: Large language models, the distributional hypothesis, and meaning [31.632816425798108]
Tokenization is a necessary component within the current architecture of many language models. We discuss how tokens and pretraining can act as a backdoor for bias and other unwanted content. We relay evidence that the tokenization algorithm's objective function impacts the large language model's cognition.
arXiv Detail & Related papers (2024-12-14T18:18:52Z)
Inspection and Control of Self-Generated-Text Recognition Ability in Llama3-8b-Instruct [0.0]
We find that the Llama3-8b-Instruct chat model can reliably distinguish its own outputs from those of humans. We identify a vector in the residual stream of the model that is differentially activated when the model makes a correct self-written-text recognition judgment. We show that the vector can be used to control both the model's behavior and its perception.
arXiv Detail & Related papers (2024-10-02T22:26:21Z)
States Hidden in Hidden States: LLMs Emerge Discrete State Representations Implicitly [72.24742240125369]
In this paper, we uncover the intrinsic ability to perform extended sequences of calculations without relying on chain-of-thought step-by-step solutions. Remarkably, the most advanced models can directly output the results of two-digit number additions with lengths extending up to 15 addends.
arXiv Detail & Related papers (2024-07-16T06:27:22Z)
Meaning Representations from Trajectories in Autoregressive Models [106.63181745054571]
We propose to extract meaning representations from autoregressive language models by considering the distribution of all possible trajectories extending an input text. This strategy is prompt-free, does not require fine-tuning, and is applicable to any pre-trained autoregressive model. We empirically show that the representations obtained from large models align well with human annotations, outperform other zero-shot and prompt-free methods on semantic similarity tasks, and can be used to solve more complex entailment and containment tasks that standard embeddings cannot handle.
arXiv Detail & Related papers (2023-10-23T04:35:58Z)
Why Can GPT Learn In-Context? Language Models Implicitly Perform Gradient Descent as Meta-Optimizers [93.9369467909176]
We explain language models as meta-optimizers and understand in-context learning as implicit finetuning. We show that in-context learning behaves similarly to explicit finetuning from multiple perspectives. The improved performance over vanilla attention further supports our understanding from another perspective.
arXiv Detail & Related papers (2022-12-20T18:58:48Z)
Transparency Helps Reveal When Language Models Learn Meaning [71.96920839263457]
Our systematic experiments with synthetic data reveal that, with languages where all expressions have context-independent denotations, both autoregressive and masked language models learn to emulate semantic relations between expressions. Turning to natural language, our experiments with a specific phenomenon -- referential opacity -- add to the growing body of evidence that current language models do not well-represent natural language semantics.
arXiv Detail & Related papers (2022-10-14T02:35:19Z)
Extracting Latent Steering Vectors from Pretrained Language Models [14.77762401765532]
We show that latent vectors can be extracted directly from language model decoders without fine-tuning. Experiments show that there exist steering vectors, which, when added to the hidden states of the language model, generate a target sentence nearly perfectly. We find that distances between steering vectors reflect sentence similarity when evaluated on a textual similarity benchmark.
arXiv Detail & Related papers (2022-05-10T19:04:37Z)
Provable Limitations of Acquiring Meaning from Ungrounded Form: What will Future Language Models Understand? [87.20342701232869]
We investigate the abilities of ungrounded systems to acquire meaning. We study whether assertions enable a system to emulate representations preserving semantic relations like equivalence. We find that assertions enable semantic emulation if all expressions in the language are referentially transparent. However, if the language uses non-transparent patterns like variable binding, we show that emulation can become an uncomputable problem.
arXiv Detail & Related papers (2021-04-22T01:00:17Z)

This list is automatically generated from the titles and abstracts of the papers in this site.