One-shot Optimized Steering Vectors Mediate Safety-relevant Behaviors in LLMs
- URL: http://arxiv.org/abs/2502.18862v2
- Date: Tue, 12 Aug 2025 23:58:47 GMT
- Title: One-shot Optimized Steering Vectors Mediate Safety-relevant Behaviors in LLMs
- Authors: Jacob Dunefsky, Arman Cohan,
- Abstract summary: We propose optimizing steering vectors through gradient descent on a single training example.<n>We find that the resulting SVs effectively mediate safety-relevant behaviors in multiple models.<n>We extend work on "emergent misalignment" and show that SVs optimized to induce a model to write vulnerable code cause the model to respond harmfully.
- Score: 21.2431937128876
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Steering vectors (SVs) have emerged as a promising approach for interpreting and controlling LLMs, but current methods typically require large contrastive datasets that are often impractical to construct and may capture spurious correlations. We propose directly optimizing SVs through gradient descent on a single training example, and systematically investigate how these SVs generalize. We consider several SV optimization techniques and find that the resulting SVs effectively mediate safety-relevant behaviors in multiple models. Indeed, in experiments on an alignment-faking model, we are able to optimize one-shot SVs that induce harmful behavior on benign examples and whose negations suppress harmful behavior on malign examples. And in experiments on refusal suppression, we demonstrate that one-shot optimized SVs can transfer across inputs, yielding a Harmbench attack success rate of 96.9%. Furthermore, we extend work on "emergent misalignment" and show that SVs optimized to induce a model to write vulnerable code cause the model to respond harmfully on unrelated open-ended prompts. Finally, we use one-shot SV optimization to investigate how an instruction-tuned LLM recovers from outputting false information, and find that this ability is independent of the model's explicit verbalization that the information was false. Overall, our findings suggest that optimizing SVs on a single example can mediate a wide array of misaligned behaviors in LLMs. Code can be found at https://github.com/jacobdunefsky/one-shot-steering-repro and https://github.com/jacobdunefsky/one-shot-steering-misalignment.
Related papers
- One-shot Optimized Steering Vector for Hallucination Mitigation for VLMs [8.089908150148554]
Vision Language Models (VLMs) achieve strong performance on multimodal tasks but still suffer from hallucination and safety-related failures.<n>We propose textbfOSGA (textbfOne-shot textbfSteering with textbfGenerative textbfAnchor), an input-independent framework that improves model performance with a single optimization instance.
arXiv Detail & Related papers (2026-01-30T14:47:59Z) - Steering Vision-Language-Action Models as Anti-Exploration: A Test-Time Scaling Approach [78.4812458793128]
We propose textbfTACO, a test-time-scaling framework that applies a lightweight pseudo-count estimator as a high-fidelity verifier of action chunks.<n>Our method resembles the classical anti-exploration principle in offline reinforcement learning (RL), and being gradient-free, it incurs significant computational benefits.
arXiv Detail & Related papers (2025-12-02T14:42:54Z) - Few-Shot Adaptation Benchmark for Remote Sensing Vision-Language Models [20.81142541450895]
We present the first structured benchmark for evaluating few-shot adaptation methods on RSVLMs.<n>We conduct comprehensive experiments across ten remote sensing scene classification datasets.<n>Our findings reveal that models with similar zero-shot performance can exhibit markedly different behavior under few-shot adaptation.
arXiv Detail & Related papers (2025-10-08T15:29:48Z) - Speculative Verification: Exploiting Information Gain to Refine Speculative Decoding [8.36763119650407]
Speculative Verification dynamically predicts speculation accuracy and adapts the verification length to maximize throughput.<n>It improves SD performance by up to 2$times$, with an average speedup of 1.4 $times$ in large-batch settings.
arXiv Detail & Related papers (2025-09-29T06:25:54Z) - SelfAug: Mitigating Catastrophic Forgetting in Retrieval-Augmented Generation via Distribution Self-Alignment [49.86376148975563]
Large language models (LLMs) have revolutionized natural language processing through their capabilities in understanding and executing diverse tasks.<n> supervised fine-tuning, particularly in Retrieval-Augmented Generation (RAG) scenarios, often leads to catastrophic forgetting.<n>We propose SelfAug, a self-distribution alignment method that aligns input sequence logits to preserve the model's semantic distribution.
arXiv Detail & Related papers (2025-09-04T06:50:47Z) - GrAInS: Gradient-based Attribution for Inference-Time Steering of LLMs and VLMs [56.93583799109029]
GrAInS is an inference-time steering approach that operates across both language-only and vision-language models and tasks.<n>During inference, GrAInS hidden activations at transformer layers guided by token-level attribution signals, and normalizes activations to preserve representational scale.<n>It consistently outperforms both fine-tuning and existing steering baselines.
arXiv Detail & Related papers (2025-07-24T02:34:13Z) - SAND: Boosting LLM Agents with Self-Taught Action Deliberation [53.732649189709285]
Large Language Model (LLM) agents are commonly tuned with supervised finetuning on ReAct-style expert trajectories or preference optimization over pairwise rollouts.<n>We propose Self-taught ActioN Deliberation (SAND) framework, enabling LLM agents to explicitly deliberate over candidate actions before committing to one.<n>SAND achieves an average 20% improvement over initial supervised finetuning and also outperforms state-of-the-art agent tuning approaches.
arXiv Detail & Related papers (2025-07-10T05:38:15Z) - ExpertSteer: Intervening in LLMs through Expert Knowledge [71.12193680015622]
Activation steering offers a promising method to control the generation process of Large Language Models.<n>We propose ExpertSteer, a novel approach that leverages arbitrary specialized expert models to generate steering vectors.<n>We conduct comprehensive experiments using three LLMs on 15 popular benchmarks across four distinct domains.
arXiv Detail & Related papers (2025-05-18T08:55:46Z) - Steering Risk Preferences in Large Language Models by Aligning Behavioral and Neural Representations [4.029252551781513]
We propose a principled approach for uncovering steering vectors.<n>We focus on extracting latent risk preferences from large language models.<n>We show that the resulting steering vectors successfully and reliably modulate LLM outputs in line with the targeted behavior.
arXiv Detail & Related papers (2025-05-16T18:23:10Z) - Patterns and Mechanisms of Contrastive Activation Engineering [0.374490703387131]
CAE has the potential to introduce a new paradigm of flexible, task-specific behavior tuning.<n>We analyze the performance of CAE in in in-distribution, out-of-distribution settings, evaluate drawbacks, and begin to develop comprehensive guidelines for its effective deployment.
arXiv Detail & Related papers (2025-05-06T05:15:12Z) - Improving Reasoning Performance in Large Language Models via Representation Engineering [2.0099933815960256]
We propose a representation engineering approach for large language models (LLMs)
Model activations are read from the residual stream of an LLM when processing a reasoning task.
We show that an LLM can, to a certain degree, be controlled to improve its perceived reasoning ability by modulating activations.
arXiv Detail & Related papers (2025-04-28T04:58:43Z) - SEAL: Steerable Reasoning Calibration of Large Language Models for Free [58.190800043449336]
Large Language Models (LLMs) have demonstrated compelling capabilities for complex reasoning tasks via the extended chain-of-thought (CoT) reasoning mechanism.
Recent studies reveal substantial redundancy in the CoT reasoning traces, which negatively impacts model performance.
We introduce SEAL, a training-free approach that seamlessly calibrates the CoT process, improving accuracy while demonstrating significant efficiency gains.
arXiv Detail & Related papers (2025-04-07T02:42:07Z) - Shifting Perspectives: Steering Vector Ensembles for Robust Bias Mitigation in LLMs [8.91107152198979]
We present a novel approach to bias mitigation in large language models (LLMs) by applying steering vectors to modify model activations in forward passes.
We employ Bayesian optimization to systematically identify effective contrastive pair datasets across nine bias axes.
Building on these promising results, we introduce Steering Vector Ensembles (SVE), a method that averages multiple individually optimized steering vectors, each targeting a specific bias axis such as age, race, or gender.
arXiv Detail & Related papers (2025-03-07T12:25:29Z) - Dobi-SVD: Differentiable SVD for LLM Compression and Some New Perspectives [59.46211685419206]
We argue that the optimal use of SVD lies in truncating activations, rather than merely using activations as an optimization distance.<n>We propose Dobi-SVD, which establishes a new, principled approach to SVD-based LLM compression.
arXiv Detail & Related papers (2025-02-04T21:17:51Z) - AdaSVD: Adaptive Singular Value Decomposition for Large Language Models [75.1196637934987]
Singular Value Decomposition (SVD) has emerged as a promising compression technique for large language models (LLMs)<n>Existing SVD-based methods often struggle to effectively mitigate the errors introduced by SVD truncation.<n>We propose AdaSVD, an adaptive SVD-based LLM compression approach.
arXiv Detail & Related papers (2025-02-03T14:34:37Z) - Debias your Large Multi-Modal Model at Test-Time via Non-Contrastive Visual Attribute Steering [7.471995248769638]
We propose a training-free debiasing framework for large Multi-Modal Models (LMMs)
Our framework intervenes on the model's representations during text generation by constructing a steering vector that reduces reference on protected attributes.
Our experiments show that these interventions effectively reduce the propensity of LMMs to generate text related to protected attributes while maintaining sentiment and fluency.
arXiv Detail & Related papers (2024-11-15T20:06:09Z) - Refusal in LLMs is an Affine Function [1.722461331472526]
We propose affine concept editing (ACE) as an approach for steering language models' behavior.<n>ACE combines affine subspace projection and activation addition to reliably control the model's refusal responses.<n>Our experiments demonstrate that ACE consistently achieves more precise control over model behavior than existing methods.
arXiv Detail & Related papers (2024-11-13T20:12:55Z) - Steering Without Side Effects: Improving Post-Deployment Control of Language Models [61.99293520621248]
Language models (LMs) have been shown to behave unexpectedly post-deployment.
We present KL-then-steer (KTS), a technique that decreases the side effects of steering while retaining its benefits.
Our best method prevents 44% of jailbreak attacks compared to the original Llama-2-chat-7B model.
arXiv Detail & Related papers (2024-06-21T01:37:39Z) - Aligning Language Models with Demonstrated Feedback [58.834937450242975]
Demonstration ITerated Task Optimization (DITTO) directly aligns language model outputs to a user's demonstrated behaviors.
We evaluate DITTO's ability to learn fine-grained style and task alignment across domains such as news articles, emails, and blog posts.
arXiv Detail & Related papers (2024-06-02T23:13:56Z) - Personalized Steering of Large Language Models: Versatile Steering Vectors Through Bi-directional Preference Optimization [34.05163996072159]
"steering vectors" are extracted from the activations of human preference data.
This work proposes an innovative approach that could produce more effective steering vectors through bi-directional preference optimization.
Our method is designed to allow steering vectors to directly influence the generation probability of contrastive human preference data pairs.
arXiv Detail & Related papers (2024-05-28T05:10:40Z) - Are Large Language Models Good Prompt Optimizers? [65.48910201816223]
We conduct a study to uncover the actual mechanism of LLM-based Prompt Optimization.
Our findings reveal that the LLMs struggle to identify the true causes of errors during reflection, tending to be biased by their own prior knowledge.
We introduce a new "Automatic Behavior Optimization" paradigm, which directly optimize the target model's behavior in a more controllable manner.
arXiv Detail & Related papers (2024-02-03T09:48:54Z) - InferAligner: Inference-Time Alignment for Harmlessness through
Cross-Model Guidance [56.184255657175335]
We develop textbfInferAligner, a novel inference-time alignment method that utilizes cross-model guidance for harmlessness alignment.
Experimental results show that our method can be very effectively applied to domain-specific models in finance, medicine, and mathematics.
It significantly diminishes the Attack Success Rate (ASR) of both harmful instructions and jailbreak attacks, while maintaining almost unchanged performance in downstream tasks.
arXiv Detail & Related papers (2024-01-20T10:41:03Z) - Amortizing intractable inference in large language models [56.92471123778389]
We use amortized Bayesian inference to sample from intractable posterior distributions.
We empirically demonstrate that this distribution-matching paradigm of LLM fine-tuning can serve as an effective alternative to maximum-likelihood training.
As an important application, we interpret chain-of-thought reasoning as a latent variable modeling problem.
arXiv Detail & Related papers (2023-10-06T16:36:08Z) - Corruption-tolerant Algorithms for Generalized Linear Models [4.127284659744835]
SVAM (Sequential Variance-Altered MLE) is a unified framework for learning generalized linear models under adversarial label corruption.
SVAM is based on a novel variance reduction technique that may be of independent interest.
arXiv Detail & Related papers (2022-12-11T07:08:02Z) - Relational Reasoning via Set Transformers: Provable Efficiency and
Applications to MARL [154.13105285663656]
A cooperative Multi-A gent R einforcement Learning (MARL) with permutation invariant agents framework has achieved tremendous empirical successes in real-world applications.
Unfortunately, the theoretical understanding of this MARL problem is lacking due to the curse of many agents and the limited exploration of the relational reasoning in existing works.
We prove that the suboptimality gaps of the model-free and model-based algorithms are independent of and logarithmic in the number of agents respectively, which mitigates the curse of many agents.
arXiv Detail & Related papers (2022-09-20T16:42:59Z) - Counterfactual Maximum Likelihood Estimation for Training Deep Networks [83.44219640437657]
Deep learning models are prone to learning spurious correlations that should not be learned as predictive clues.
We propose a causality-based training framework to reduce the spurious correlations caused by observable confounders.
We conduct experiments on two real-world tasks: Natural Language Inference (NLI) and Image Captioning.
arXiv Detail & Related papers (2021-06-07T17:47:16Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.