Related papers: Representation Tuning

Representation Tuning

URL: http://arxiv.org/abs/2409.06927v3
Date: Wed, 9 Oct 2024 13:39:27 GMT
Title: Representation Tuning
Authors: Christopher M. Ackerman,
Abstract summary: Activation engineering is becoming increasingly popular as a means of online control of large language models. I extend the idea of active steering with vectors that represent a behavioral direction of interest to tuning those vectors directly into the model.
Score: 0.0
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Activation engineering is becoming increasingly popular as a means of online control of large language models (LLMs). In this work, I extend the idea of active steering with vectors that represent a behavioral direction of interest to tuning those vectors directly into the model, obviating the need for online control. First, I identify activation vectors related to honesty in an open-source LLM (Llama- 2-13b-chat). Next, I demonstrate that model output can be made more or less honest by adding positive or negative multiples of these vectors to residual stream activations during generation. Then, I show that a similar effect can be achieved by fine-tuning the vectors directly into the model, by use of a dual loss function based on the cosine similarity of residual stream activations to the vectors combined with a standard token-based loss ("representation tuning"). Finally, I compare the generations in response to honesty-probing prompts from the resulting models to those from models fine-tuned with a token-based loss alone, and to those from the untuned model subjected to online steering. Overall, fine-tuning the vectors into the models using the cosine similarity plus token loss showed a stronger effect than online steering, and generalized better than using the standard loss, suggesting the potential utility of this approach as a safety measure. Code and data are available at https://github.com/cma1114/representation_tuning; tuned models are available at https://huggingface.co/collections/cackerman/ representation-tuning-66da1e5ab41cd1b824687d9f.

Related papers

Activation Scaling for Steering and Interpreting Language Models [55.59689963561315]
We argue that successfully intervening on a model is a prerequisite for interpreting its internal workings. We establish a three-term objective: a successful intervention should flip the correct with the wrong token and vice versa. Using gradient-based optimization, this objective lets us learn (and later evaluate) a specific kind of efficient and interpretable intervention.
arXiv Detail & Related papers (2024-10-07T12:01:32Z)
Inspection and Control of Self-Generated-Text Recognition Ability in Llama3-8b-Instruct [0.0]
We find that the Llama3-8b-Instruct chat model can reliably distinguish its own outputs from those of humans. We identify a vector in the residual stream of the model that is differentially activated when the model makes a correct self-written-text recognition judgment. We show that the vector can be used to control both the model's behavior and its perception.
arXiv Detail & Related papers (2024-10-02T22:26:21Z)
Promises and Pitfalls of Generative Masked Language Modeling: Theoretical Framework and Practical Guidelines [74.42485647685272]
We focus on Generative Masked Language Models (GMLMs) We train a model to fit conditional probabilities of the data distribution via masking, which are subsequently used as inputs to a Markov Chain to draw samples from the model. We adapt the T5 model for iteratively-refined parallel decoding, achieving 2-3x speedup in machine translation with minimal sacrifice in quality.
arXiv Detail & Related papers (2024-07-22T18:00:00Z)
Steering Without Side Effects: Improving Post-Deployment Control of Language Models [61.99293520621248]
Language models (LMs) have been shown to behave unexpectedly post-deployment. We present KL-then-steer (KTS), a technique that decreases the side effects of steering while retaining its benefits. Our best method prevents 44% of jailbreak attacks compared to the original Llama-2-chat-7B model.
arXiv Detail & Related papers (2024-06-21T01:37:39Z)
Autoregressive Image Generation without Vector Quantization [31.798754606008067]
Conventional wisdom holds that autoregressive models for image generation are typically accompanied by vector-quantized tokens. We propose to model the per-token probability distribution using a diffusion procedure, which allows us to apply autoregressive models in a continuous-valued space.
arXiv Detail & Related papers (2024-06-17T17:59:58Z)
Minusformer: Improving Time Series Forecasting by Progressively Learning Residuals [14.741951369068877]
We find that ubiquitous time series (TS) forecasting models are prone to severe overfitting. We introduce a dual-stream and subtraction mechanism, which is a deep Boosting ensemble learning method. The proposed method outperform existing state-of-the-art methods, yielding an average performance improvement of 11.9% across various datasets.
arXiv Detail & Related papers (2024-02-04T03:54:31Z)
Meaning Representations from Trajectories in Autoregressive Models [106.63181745054571]
We propose to extract meaning representations from autoregressive language models by considering the distribution of all possible trajectories extending an input text. This strategy is prompt-free, does not require fine-tuning, and is applicable to any pre-trained autoregressive model. We empirically show that the representations obtained from large models align well with human annotations, outperform other zero-shot and prompt-free methods on semantic similarity tasks, and can be used to solve more complex entailment and containment tasks that standard embeddings cannot handle.
arXiv Detail & Related papers (2023-10-23T04:35:58Z)
Transformers meet Stochastic Block Models: Attention with Data-Adaptive Sparsity and Cost [53.746169882193456]
Recent works have proposed various sparse attention modules to overcome the quadratic cost of self-attention. We propose a model that resolves both problems by endowing each attention head with a mixed-membership Block Model. Our model outperforms previous efficient variants as well as the original Transformer with full attention.
arXiv Detail & Related papers (2022-10-27T15:30:52Z)
Extracting Latent Steering Vectors from Pretrained Language Models [14.77762401765532]
We show that latent vectors can be extracted directly from language model decoders without fine-tuning. Experiments show that there exist steering vectors, which, when added to the hidden states of the language model, generate a target sentence nearly perfectly. We find that distances between steering vectors reflect sentence similarity when evaluated on a textual similarity benchmark.
arXiv Detail & Related papers (2022-05-10T19:04:37Z)
Autoencoding Variational Autoencoder [56.05008520271406]
We study the implications of this behaviour on the learned representations and also the consequences of fixing it by introducing a notion of self consistency. We show that encoders trained with our self-consistency approach lead to representations that are robust (insensitive) to perturbations in the input introduced by adversarial attacks.
arXiv Detail & Related papers (2020-12-07T14:16:14Z)

This list is automatically generated from the titles and abstracts of the papers in this site.