Related papers: HSI: Head-Specific Intervention Can Induce Misaligned AI Coordination in Large Language Models

HSI: Head-Specific Intervention Can Induce Misaligned AI Coordination in Large Language Models

URL: http://arxiv.org/abs/2502.05945v2
Date: Thu, 01 May 2025 09:03:35 GMT
Title: HSI: Head-Specific Intervention Can Induce Misaligned AI Coordination in Large Language Models
Authors: Paul Darm, Annalisa Riccardi,
Abstract summary: We show that inference-time activation interventions can bypass safety alignments and effectively steer model generations towards harmful AI coordination for Llama 2.<n>Our method applies fine-grained interventions at specific model sub-components, particularly attention heads, using a simple binary choice probing strategy.<n>We show that probing single attention heads is more effective than intervening on full layers and intervening on only four attention heads is comparable to supervised fine-tuning.
Score: 2.6703221234079946
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Robust alignment guardrails for large language models are becoming increasingly important with their widespread application. In contrast to previous studies, we demonstrate that inference-time activation interventions can bypass safety alignments and effectively steer model generations towards harmful AI coordination for Llama 2. Our method applies fine-grained interventions at specific model subcomponents, particularly attention heads, using a simple binary choice probing strategy. These interventions then generalise to the open-ended generation setting effectively circumventing safety guardrails. We show that probing single attention heads is more effective than intervening on full layers and intervening on only four attention heads is comparable to supervised fine-tuning. We further show that only a few example completions are needed to compute effective steering directions, which is an advantage over classical fine-tuning. Our findings highlight the shortcomings of current alignment techniques. In addition, our results suggest that, at the attention head level, activations encode fine-grained linearly separable behaviors. Practically, the approach offers a straightforward methodology to steer large language model behaviour, which could be extended to diverse domains beyond safety requiring fine-grained control over the model output. The code and datasets for this study can be found on https://github.com/PaulDrm/targeted_intervention.

Related papers

Learning Distribution-Wise Control in Representation Space for Language Models [7.756342860929851]
Learnable interventions aim to apply pointwise control within the concept subspace and have proven effective in altering high-level behaviors.<n>We extend this approach to the distribution level, enabling the model to learn not only pointwise transformations but also the surrounding regions of the concept subspace.
arXiv Detail & Related papers (2025-06-07T06:52:58Z)
Activation Space Interventions Can Be Transferred Between Large Language Models [0.0]
We show that safety interventions can be transferred between models through learned mappings of their shared activation spaces. We demonstrate this approach on two well-established AI safety tasks: backdoor removal and refusal of harmful prompts. We also propose a new task, textitcorrupted capabilities, where models are fine-tuned to embed knowledge tied to a backdoor.
arXiv Detail & Related papers (2025-03-06T13:38:44Z)
Long-Sequence Recommendation Models Need Decoupled Embeddings [49.410906935283585]
We identify and characterize a neglected deficiency in existing long-sequence recommendation models. A single set of embeddings struggles with learning both attention and representation, leading to interference between these two processes. We propose the Decoupled Attention and Representation Embeddings (DARE) model, where two distinct embedding tables are learned separately to fully decouple attention and representation.
arXiv Detail & Related papers (2024-10-03T15:45:15Z)
Causality-Aware Transformer Networks for Robotic Navigation [13.719643934968367]
Current research in Visual Navigation reveals opportunities for improvement. Direct adoption of RNNs and Transformers often overlooks the specific differences between Embodied AI and traditional sequential data modelling. We propose Causality-Aware Transformer (CAT) Networks for Navigation, featuring a Causal Understanding Module.
arXiv Detail & Related papers (2024-09-04T12:53:26Z)
FinePseudo: Improving Pseudo-Labelling through Temporal-Alignablity for Semi-Supervised Fine-Grained Action Recognition [57.17966905865054]
Real-life applications of action recognition often require a fine-grained understanding of subtle movements. Existing semi-supervised action recognition has mainly focused on coarse-grained action recognition. We propose an Alignability-Verification-based Metric learning technique to effectively discriminate between fine-grained action pairs.
arXiv Detail & Related papers (2024-09-02T20:08:06Z)
A Unifying Framework for Action-Conditional Self-Predictive Reinforcement Learning [48.59516337905877]
Learning a good representation is a crucial challenge for Reinforcement Learning (RL) agents. Recent work has developed theoretical insights into these algorithms. We take a step towards bridging the gap between theory and practice by analyzing an action-conditional self-predictive objective.
arXiv Detail & Related papers (2024-06-04T07:22:12Z)
One-Shot Safety Alignment for Large Language Models via Optimal Dualization [64.52223677468861]
This paper presents a perspective of dualization that reduces constrained alignment to an equivalent unconstrained alignment problem. We do so by pre-optimizing a smooth and convex dual function that has a closed form. Our strategy leads to two practical algorithms in model-based and preference-based settings.
arXiv Detail & Related papers (2024-05-29T22:12:52Z)
Personalized Steering of Large Language Models: Versatile Steering Vectors Through Bi-directional Preference Optimization [34.05163996072159]
"steering vectors" are extracted from the activations of human preference data. This work proposes an innovative approach that could produce more effective steering vectors through bi-directional preference optimization. Our method is designed to allow steering vectors to directly influence the generation probability of contrastive human preference data pairs.
arXiv Detail & Related papers (2024-05-28T05:10:40Z)
Inverse-RLignment: Large Language Model Alignment from Demonstrations through Inverse Reinforcement Learning [62.05713042908654]
We introduce Alignment from Demonstrations (AfD), a novel approach leveraging high-quality demonstration data to overcome these challenges. We formalize AfD within a sequential decision-making framework, highlighting its unique challenge of missing reward signals. Practically, we propose a computationally efficient algorithm that extrapolates over a tailored reward model for AfD.
arXiv Detail & Related papers (2024-05-24T15:13:53Z)
Enhancing Interaction Modeling with Agent Selection and Physical Coefficient for Trajectory Prediction [1.6954753390775528]
We present ASPILin, which manually selects interacting agents and calculates their correlations instead of attention scores. Remarkably, experiments conducted on the INTERACTION, highD, and CitySim datasets demonstrate that our method is efficient and straightforward.
arXiv Detail & Related papers (2024-05-21T18:45:18Z)
AI, Pluralism, and (Social) Compensation [1.5442389863546546]
A strategy in response to pluralistic values in a user population is to personalize an AI system. If the AI can adapt to the specific values of each individual, then we can potentially avoid many of the challenges of pluralism. However, if there is an external measure of success for the human-AI team, then the adaptive AI system may develop strategies to compensate for its human teammate.
arXiv Detail & Related papers (2024-04-30T04:41:47Z)
InferAligner: Inference-Time Alignment for Harmlessness through Cross-Model Guidance [56.184255657175335]
We develop textbfInferAligner, a novel inference-time alignment method that utilizes cross-model guidance for harmlessness alignment. Experimental results show that our method can be very effectively applied to domain-specific models in finance, medicine, and mathematics. It significantly diminishes the Attack Success Rate (ASR) of both harmful instructions and jailbreak attacks, while maintaining almost unchanged performance in downstream tasks.
arXiv Detail & Related papers (2024-01-20T10:41:03Z)
Interactive Autonomous Navigation with Internal State Inference and Interactivity Estimation [58.21683603243387]
We propose three auxiliary tasks with relational-temporal reasoning and integrate them into the standard Deep Learning framework. These auxiliary tasks provide additional supervision signals to infer the behavior patterns other interactive agents. Our approach achieves robust and state-of-the-art performance in terms of standard evaluation metrics.
arXiv Detail & Related papers (2023-11-27T18:57:42Z)
Beyond Imitation: Leveraging Fine-grained Quality Signals for Alignment [105.34140537748546]
We propose an improved alignment approach named FIGA. Different from prior methods, we incorporate fine-grained quality signals that are derived by contrasting good and bad responses. Our approach has made two major contributions. Firstly, we curate a refined alignment dataset that pairs initial responses and the corresponding revised ones. Secondly, we devise a new loss function can leverage fine-grained quality signals to instruct the learning of LLMs for alignment.
arXiv Detail & Related papers (2023-11-07T15:36:40Z)
Domain Knowledge Driven Pseudo Labels for Interpretable Goal-Conditioned Interactive Trajectory Prediction [29.701029725302586]
We study the joint trajectory prediction problem with the goal-conditioned framework. We introduce a conditional-variational-autoencoder-based (CVAE) model to explicitly encode different interaction modes into the latent space. We propose a novel approach to avoid KL vanishing and induce an interpretable interactive latent space with pseudo labels.
arXiv Detail & Related papers (2022-03-28T21:41:21Z)
Behavior Priors for Efficient Reinforcement Learning [97.81587970962232]
We consider how information and architectural constraints can be combined with ideas from the probabilistic modeling literature to learn behavior priors. We discuss how such latent variable formulations connect to related work on hierarchical reinforcement learning (HRL) and mutual information and curiosity based objectives. We demonstrate the effectiveness of our framework by applying it to a range of simulated continuous control domains.
arXiv Detail & Related papers (2020-10-27T13:17:18Z)
End-to-End Learning and Intervention in Games [60.41921763076017]
We provide a unified framework for learning and intervention in games. We propose two approaches, respectively based on explicit and implicit differentiation. The analytical results are validated using several real-world problems.
arXiv Detail & Related papers (2020-10-26T18:39:32Z)
Goal-Aware Prediction: Learning to Model What Matters [105.43098326577434]
One of the fundamental challenges in using a learned forward dynamics model is the mismatch between the objective of the learned model and that of the downstream planner or policy. We propose to direct prediction towards task relevant information, enabling the model to be aware of the current task and encouraging it to only model relevant quantities of the state space. We find that our method more effectively models the relevant parts of the scene conditioned on the goal, and as a result outperforms standard task-agnostic dynamics models and model-free reinforcement learning.
arXiv Detail & Related papers (2020-07-14T16:42:59Z)
Cost-effective Interactive Attention Learning with Neural Attention Processes [79.8115563067513]
We propose a novel interactive learning framework which we refer to as Interactive Attention Learning (IAL) IAL is prone to overfitting due to scarcity of human annotations, and requires costly retraining. We tackle these challenges by proposing a sample-efficient attention mechanism and a cost-effective reranking algorithm for instances and features.
arXiv Detail & Related papers (2020-06-09T17:36:41Z)

This list is automatically generated from the titles and abstracts of the papers in this site.