Related papers: Inspection and Control of Self-Generated-Text Recognition Ability in Llama3-8b-Instruct

Inspection and Control of Self-Generated-Text Recognition Ability in Llama3-8b-Instruct

URL: http://arxiv.org/abs/2410.02064v1
Date: Wed, 2 Oct 2024 22:26:21 GMT
Title: Inspection and Control of Self-Generated-Text Recognition Ability in Llama3-8b-Instruct
Authors: Christopher Ackerman, Nina Panickssery,
Abstract summary: We find that the Llama3-8b-Instruct chat model can reliably distinguish its own outputs from those of humans. We identify a vector in the residual stream of the model that is differentially activated when the model makes a correct self-written-text recognition judgment. We show that the vector can be used to control both the model's behavior and its perception.
Score: 0.0
License: http://creativecommons.org/licenses/by/4.0/
Abstract: It has been reported that LLMs can recognize their own writing. As this has potential implications for AI safety, yet is relatively understudied, we investigate the phenomenon, seeking to establish whether it robustly occurs at the behavioral level, how the observed behavior is achieved, and whether it can be controlled. First, we find that the Llama3-8b-Instruct chat model - but not the base Llama3-8b model - can reliably distinguish its own outputs from those of humans, and present evidence that the chat model is likely using its experience with its own outputs, acquired during post-training, to succeed at the writing recognition task. Second, we identify a vector in the residual stream of the model that is differentially activated when the model makes a correct self-written-text recognition judgment, show that the vector activates in response to information relevant to self-authorship, present evidence that the vector is related to the concept of "self" in the model, and demonstrate that the vector is causally related to the model's ability to perceive and assert self-authorship. Finally, we show that the vector can be used to control both the model's behavior and its perception, steering the model to claim or disclaim authorship by applying the vector to the model's output as it generates it, and steering the model to believe or disbelieve it wrote arbitrary texts by applying the vector to them as the model reads them.

Related papers

Self-rewarding correction for mathematical reasoning [19.480508580498103]
We study self-rewarding reasoning large language models (LLMs) LLMs can simultaneously generate step-by-step reasoning and evaluate the correctness of their outputs during the inference time-without external feedback. We propose a two-staged algorithmic framework for constructing self-rewarding reasoning models using only self-generated data.
arXiv Detail & Related papers (2025-02-26T23:01:16Z)
Predicting the Performance of Black-box LLMs through Self-Queries [60.87193950962585]
Large language models (LLMs) are increasingly relied on in AI systems, predicting when they make mistakes is crucial. In this paper, we extract features of LLMs in a black-box manner by using follow-up prompts and taking the probabilities of different responses as representations. We demonstrate that training a linear model on these low-dimensional representations produces reliable predictors of model performance at the instance level.
arXiv Detail & Related papers (2025-01-02T22:26:54Z)
Activation Scaling for Steering and Interpreting Language Models [55.59689963561315]
We argue that successfully intervening on a model is a prerequisite for interpreting its internal workings. We establish a three-term objective: a successful intervention should flip the correct with the wrong token and vice versa. Using gradient-based optimization, this objective lets us learn (and later evaluate) a specific kind of efficient and interpretable intervention.
arXiv Detail & Related papers (2024-10-07T12:01:32Z)
Representation Tuning [0.0]
Activation engineering is becoming increasingly popular as a means of online control of large language models. In this work, we extend the idea of inference-time steering with vectors that represent a behavioral direction of interest to tuning those vectors directly into the model.
arXiv Detail & Related papers (2024-09-11T00:56:02Z)
Meanings and Feelings of Large Language Models: Observability of Latent States in Generative AI [65.04274914674771]
We show that current Large Language Models (LLMs) cannot have 'feelings' according to the American Psychological Association (APA) Our analysis sheds light on possible designs that would enable a model to perform non-trivial computation that is not visible to the user.
arXiv Detail & Related papers (2024-05-22T23:18:58Z)
MOVE: Effective and Harmless Ownership Verification via Embedded External Features [109.19238806106426]
We propose an effective and harmless model ownership verification (MOVE) to defend against different types of model stealing simultaneously. We conduct the ownership verification by verifying whether a suspicious model contains the knowledge of defender-specified external features. In particular, we develop our MOVE method under both white-box and black-box settings to provide comprehensive model protection.
arXiv Detail & Related papers (2022-08-04T02:22:29Z)
Extracting Latent Steering Vectors from Pretrained Language Models [14.77762401765532]
We show that latent vectors can be extracted directly from language model decoders without fine-tuning. Experiments show that there exist steering vectors, which, when added to the hidden states of the language model, generate a target sentence nearly perfectly. We find that distances between steering vectors reflect sentence similarity when evaluated on a textual similarity benchmark.
arXiv Detail & Related papers (2022-05-10T19:04:37Z)
Controlling the Focus of Pretrained Language Generation Models [22.251710018744497]
We develop a control mechanism by which a user can select spans of context as "highlights" for the model to focus on, and generate relevant output. To achieve this goal, we augment a pretrained model with trainable "focus vectors" that are directly applied to the model's embeddings. Our experiments show that the trained focus vectors are effective in steering the model to generate outputs that are relevant to user-selected highlights.
arXiv Detail & Related papers (2022-03-02T14:46:14Z)
Explain, Edit, and Understand: Rethinking User Study Design for Evaluating Model Explanations [97.91630330328815]
We conduct a crowdsourcing study, where participants interact with deception detection models that have been trained to distinguish between genuine and fake hotel reviews. We observe that for a linear bag-of-words model, participants with access to the feature coefficients during training are able to cause a larger reduction in model confidence in the testing phase when compared to the no-explanation control.
arXiv Detail & Related papers (2021-12-17T18:29:56Z)
Reason induced visual attention for explainable autonomous driving [2.090380922731455]
Deep learning (DL) based computer vision (CV) models are generally considered as black boxes due to poor interpretability. This study is motivated by the need to enhance the interpretability of DL model in autonomous driving. The proposed framework imitates the learning process of human drivers by jointly modeling the visual input (images) and natural language.
arXiv Detail & Related papers (2021-10-11T18:50:41Z)
Improving Aspect-based Sentiment Analysis with Gated Graph Convolutional Networks and Syntax-based Regulation [89.38054401427173]
Aspect-based Sentiment Analysis (ABSA) seeks to predict the sentiment polarity of a sentence toward a specific aspect. dependency trees can be integrated into deep learning models to produce the state-of-the-art performance for ABSA. We propose a novel graph-based deep learning model to overcome these two issues.
arXiv Detail & Related papers (2020-10-26T07:36:24Z)
A Controllable Model of Grounded Response Generation [122.7121624884747]
Current end-to-end neural conversation models inherently lack the flexibility to impose semantic control in the response generation process. We propose a framework that we call controllable grounded response generation (CGRG) We show that using this framework, a transformer based model with a novel inductive attention mechanism, trained on a conversation-like Reddit dataset, outperforms strong generation baselines.
arXiv Detail & Related papers (2020-05-01T21:22:08Z)

This list is automatically generated from the titles and abstracts of the papers in this site.