Related papers: Steering Without Side Effects: Improving Post-Deployment Control of Language Models

Steering Without Side Effects: Improving Post-Deployment Control of Language Models

URL: http://arxiv.org/abs/2406.15518v1
Date: Fri, 21 Jun 2024 01:37:39 GMT
Title: Steering Without Side Effects: Improving Post-Deployment Control of Language Models
Authors: Asa Cooper Stickland, Alexander Lyzhov, Jacob Pfau, Salsabila Mahdi, Samuel R. Bowman,
Abstract summary: Language models (LMs) have been shown to behave unexpectedly post-deployment. We present KL-then-steer (KTS), a technique that decreases the side effects of steering while retaining its benefits. Our best method prevents 44% of jailbreak attacks compared to the original Llama-2-chat-7B model.
Score: 61.99293520621248
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Language models (LMs) have been shown to behave unexpectedly post-deployment. For example, new jailbreaks continually arise, allowing model misuse, despite extensive red-teaming and adversarial training from developers. Given most model queries are unproblematic and frequent retraining results in unstable user experience, methods for mitigation of worst-case behavior should be targeted. One such method is classifying inputs as potentially problematic, then selectively applying steering vectors on these problematic inputs, i.e. adding particular vectors to model hidden states. However, steering vectors can also negatively affect model performance, which will be an issue on cases where the classifier was incorrect. We present KL-then-steer (KTS), a technique that decreases the side effects of steering while retaining its benefits, by first training a model to minimize Kullback-Leibler (KL) divergence between a steered and unsteered model on benign inputs, then steering the model that has undergone this training. Our best method prevents 44% of jailbreak attacks compared to the original Llama-2-chat-7B model while maintaining helpfulness (as measured by MT-Bench) on benign requests almost on par with the original LM. To demonstrate the generality and transferability of our method beyond jailbreaks, we show that our KTS model can be steered to reduce bias towards user-suggested answers on TruthfulQA. Code is available: https://github.com/AsaCooperStickland/kl-then-steer.

Related papers

AdaSteer: Your Aligned LLM is Inherently an Adaptive Jailbreak Defender [73.09848497762667]
We propose AdaSteer, an adaptive activation steering method that adjusts model behavior based on input characteristics. AdaSteer steers input representations along both the Rejection Direction (RD) and Harmfulness Direction (HD) Our results highlight the potential of interpretable model internals for real-time, flexible safety enforcement in LLMs.
arXiv Detail & Related papers (2025-04-13T07:39:17Z)
Steering Language Model Refusal with Sparse Autoencoders [16.78963326253821]
We identify and steer features in Phi-3 Mini that mediate refusal behavior. We find that feature steering can improve Phi-3 Minis robustness to jailbreak attempts across various harms. However, feature steering can adversely affect overall performance on benchmarks.
arXiv Detail & Related papers (2024-11-18T05:47:02Z)
Mitigating Covariate Shift in Imitation Learning for Autonomous Vehicles Using Latent Space Generative World Models [60.87795376541144]
A world model is a neural network capable of predicting an agent's next state given past states and actions. During end-to-end training, our policy learns how to recover from errors by aligning with states observed in human demonstrations. We present qualitative and quantitative results, demonstrating significant improvements upon prior state of the art in closed-loop testing.
arXiv Detail & Related papers (2024-09-25T06:48:25Z)
Training Language Models to Self-Correct via Reinforcement Learning [98.35197671595343]
Self-correction has been found to be largely ineffective in modern large language models (LLMs) We develop a multi-turn online reinforcement learning approach, SCoRe, that significantly improves an LLM's self-correction ability using entirely self-generated data. We find that SCoRe achieves state-of-the-art self-correction performance, improving the base models' self-correction by 15.6% and 9.1% respectively on MATH and HumanEval.
arXiv Detail & Related papers (2024-09-19T17:16:21Z)
Representation Tuning [0.0]
Activation engineering is becoming increasingly popular as a means of online control of large language models. In this work, we extend the idea of inference-time steering with vectors that represent a behavioral direction of interest to tuning those vectors directly into the model.
arXiv Detail & Related papers (2024-09-11T00:56:02Z)
Refusal in Language Models Is Mediated by a Single Direction [4.532520427311685]
We show that refusal is mediated by a one-dimensional subspace, across 13 popular open-source chat models up to 72B parameters in size. We propose a novel white-box jailbreak method that surgically disables refusal with minimal effect on other capabilities.
arXiv Detail & Related papers (2024-06-17T16:36:12Z)
Manipulating and Mitigating Generative Model Biases without Retraining [49.60774626839712]
We propose a dynamic and computationally efficient manipulation of T2I model biases by exploiting their rich language embedding spaces without model retraining. We show that leveraging foundational vector algebra allows for a convenient control over language model embeddings to shift T2I model outputs. As a by-product, this control serves as a form of precise prompt engineering to generate images which are generally implausible using regular text prompts.
arXiv Detail & Related papers (2024-04-03T07:33:30Z)
From Hope to Safety: Unlearning Biases of Deep Models via Gradient Penalization in Latent Space [13.763716495058294]
Deep Neural Networks are prone to learning spurious correlations embedded in the training data, leading to potentially biased predictions. This poses risks when deploying these models for high-stake decision-making, such as in medical applications. We present a novel method for model correction on the concept level that explicitly reduces model sensitivity towards biases via gradient penalization.
arXiv Detail & Related papers (2023-08-18T10:07:46Z)
Explain, Edit, and Understand: Rethinking User Study Design for Evaluating Model Explanations [97.91630330328815]
We conduct a crowdsourcing study, where participants interact with deception detection models that have been trained to distinguish between genuine and fake hotel reviews. We observe that for a linear bag-of-words model, participants with access to the feature coefficients during training are able to cause a larger reduction in model confidence in the testing phase when compared to the no-explanation control.
arXiv Detail & Related papers (2021-12-17T18:29:56Z)
Positive-Congruent Training: Towards Regression-Free Model Updates [87.25247195148187]
In image classification, sample-wise inconsistencies appear as "negative flips" A new model incorrectly predicts the output for a test sample that was correctly classified by the old (reference) model. We propose a simple approach for PC training, Focal Distillation, which enforces congruence with the reference model.
arXiv Detail & Related papers (2020-11-18T09:00:44Z)

This list is automatically generated from the titles and abstracts of the papers in this site.