Extracting Latent Steering Vectors from Pretrained Language Models
- URL: http://arxiv.org/abs/2205.05124v1
- Date: Tue, 10 May 2022 19:04:37 GMT
- Title: Extracting Latent Steering Vectors from Pretrained Language Models
- Authors: Nishant Subramani, Nivedita Suresh, Matthew E. Peters
- Abstract summary: We show that latent vectors can be extracted directly from language model decoders without fine-tuning.
Experiments show that there exist steering vectors, which, when added to the hidden states of the language model, generate a target sentence nearly perfectly.
We find that distances between steering vectors reflect sentence similarity when evaluated on a textual similarity benchmark.
- Score: 14.77762401765532
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Prior work on controllable text generation has focused on learning how to
control language models through trainable decoding, smart-prompt design, or
fine-tuning based on a desired objective. We hypothesize that the information
needed to steer the model to generate a target sentence is already encoded
within the model. Accordingly, we explore a different approach altogether:
extracting latent vectors directly from pretrained language model decoders
without fine-tuning. Experiments show that there exist steering vectors, which,
when added to the hidden states of the language model, generate a target
sentence nearly perfectly (> 99 BLEU) for English sentences from a variety of
domains. We show that vector arithmetic can be used for unsupervised sentiment
transfer on the Yelp sentiment benchmark, with performance comparable to models
tailored to this task. We find that distances between steering vectors reflect
sentence similarity when evaluated on a textual similarity benchmark (STS-B),
outperforming pooled hidden states of models. Finally, we present an analysis
of the intrinsic properties of the steering vectors. Taken together, our
results suggest that frozen LMs can be effectively controlled through their
latent steering space.
Related papers
- Activation Scaling for Steering and Interpreting Language Models [55.59689963561315]
We argue that successfully intervening on a model is a prerequisite for interpreting its internal workings.
We establish a three-term objective: a successful intervention should flip the correct with the wrong token and vice versa.
Using gradient-based optimization, this objective lets us learn (and later evaluate) a specific kind of efficient and interpretable intervention.
arXiv Detail & Related papers (2024-10-07T12:01:32Z) - Uncovering Latent Chain of Thought Vectors in Language Models [2.6089354079273512]
We investigate the technique of steering vectors: biasing the forward pass of language models using a "steering vector" derived from a specific task.
We apply them to steer language models toward performing Chain of Thought (CoT) Reasoning without the need to prompt through natural language.
We find this approach yields consistent steering towards CoT responses and takes less compute than traditional methods of fine-tuning models towards CoT.
arXiv Detail & Related papers (2024-09-21T05:58:07Z) - Improving Activation Steering in Language Models with Mean-Centring [10.101141087916133]
We find that taking the average of activations associated with a target dataset, and subtracting the mean of all training activations, results in effective steering vectors.
We also apply mean-centring to extract function vectors, more effectively triggering the execution of a range of natural language tasks by a significant margin.
arXiv Detail & Related papers (2023-12-06T18:27:07Z) - GanLM: Encoder-Decoder Pre-training with an Auxiliary Discriminator [114.8954615026781]
We propose a GAN-style model for encoder-decoder pre-training by introducing an auxiliary discriminator.
GanLM is trained with two pre-training objectives: replaced token detection and replaced token denoising.
Experiments in language generation benchmarks show that GanLM with the powerful language understanding capability outperforms various strong pre-trained language models.
arXiv Detail & Related papers (2022-12-20T12:51:11Z) - Language Model Pre-Training with Sparse Latent Typing [66.75786739499604]
We propose a new pre-training objective, Sparse Latent Typing, which enables the model to sparsely extract sentence-level keywords with diverse latent types.
Experimental results show that our model is able to learn interpretable latent type categories in a self-supervised manner without using any external knowledge.
arXiv Detail & Related papers (2022-10-23T00:37:08Z) - Internal language model estimation through explicit context vector
learning for attention-based encoder-decoder ASR [19.233720469733797]
We propose two novel approaches to estimate the biased ILM based on Listen-Attend-Spell (LAS) models.
Experiments show that the ILMs estimated by the proposed methods achieve the lowest perplexity.
arXiv Detail & Related papers (2022-01-26T07:47:27Z) - Sentence Bottleneck Autoencoders from Transformer Language Models [53.350633961266375]
We build a sentence-level autoencoder from a pretrained, frozen transformer language model.
We adapt the masked language modeling objective as a generative, denoising one, while only training a sentence bottleneck and a single-layer modified transformer decoder.
We demonstrate that the sentence representations discovered by our model achieve better quality than previous methods that extract representations from pretrained transformers on text similarity tasks, style transfer, and single-sentence classification tasks in the GLUE benchmark, while using fewer parameters than large pretrained models.
arXiv Detail & Related papers (2021-08-31T19:39:55Z) - Unsupervised Paraphrasing with Pretrained Language Models [85.03373221588707]
We propose a training pipeline that enables pre-trained language models to generate high-quality paraphrases in an unsupervised setting.
Our recipe consists of task-adaptation, self-supervision, and a novel decoding algorithm named Dynamic Blocking.
We show with automatic and human evaluations that our approach achieves state-of-the-art performance on both the Quora Question Pair and the ParaNMT datasets.
arXiv Detail & Related papers (2020-10-24T11:55:28Z) - Exemplar-Controllable Paraphrasing and Translation using Bitext [57.92051459102902]
We adapt models from prior work to be able to learn solely from bilingual text (bitext)
Our single proposed model can perform four tasks: controlled paraphrase generation in both languages and controlled machine translation in both language directions.
arXiv Detail & Related papers (2020-10-12T17:02:50Z) - Self-Supervised Contrastive Learning for Unsupervised Phoneme
Segmentation [37.054709598792165]
The model is a convolutional neural network that operates directly on the raw waveform.
It is optimized to identify spectral changes in the signal using the Noise-Contrastive Estimation principle.
At test time, a peak detection algorithm is applied over the model outputs to produce the final boundaries.
arXiv Detail & Related papers (2020-07-27T12:10:21Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.