Related papers: Activation Space Interventions Can Be Transferred Between Large Language Models

Activation Space Interventions Can Be Transferred Between Large Language Models

URL: http://arxiv.org/abs/2503.04429v1
Date: Thu, 06 Mar 2025 13:38:44 GMT
Title: Activation Space Interventions Can Be Transferred Between Large Language Models
Authors: Narmeen Oozeer, Dhruv Nathawani, Nirmalendu Prakash, Michael Lan, Abir Harrasse, Amirali Abdullah,
Abstract summary: We show that safety interventions can be transferred between models through learned mappings of their shared activation spaces.<n>We demonstrate this approach on two well-established AI safety tasks: backdoor removal and refusal of harmful prompts.<n>We also propose a new task, textitcorrupted capabilities, where models are fine-tuned to embed knowledge tied to a backdoor.
Score: 0.0
License: http://creativecommons.org/licenses/by-sa/4.0/
Abstract: The study of representation universality in AI models reveals growing convergence across domains, modalities, and architectures. However, the practical applications of representation universality remain largely unexplored. We bridge this gap by demonstrating that safety interventions can be transferred between models through learned mappings of their shared activation spaces. We demonstrate this approach on two well-established AI safety tasks: backdoor removal and refusal of harmful prompts, showing successful transfer of steering vectors that alter the models' outputs in a predictable way. Additionally, we propose a new task, \textit{corrupted capabilities}, where models are fine-tuned to embed knowledge tied to a backdoor. This tests their ability to separate useful skills from backdoors, reflecting real-world challenges. Extensive experiments across Llama, Qwen and Gemma model families show that our method enables using smaller models to efficiently align larger ones. Furthermore, we demonstrate that autoencoder mappings between base and fine-tuned models can serve as reliable ``lightweight safety switches", allowing dynamic toggling between model behaviors.

Related papers

From Seeing to Experiencing: Scaling Navigation Foundation Models with Reinforcement Learning [59.88543114325153]
We introduce the Seeing-to-Experiencing framework to scale the capability of navigation foundation models with reinforcement learning.<n>S2E combines the strengths of pre-training on videos and post-training through RL.<n>We establish a comprehensive end-to-end evaluation benchmark, NavBench-GS, built on photorealistic 3DGS reconstructions of real-world scenes.
arXiv Detail & Related papers (2025-07-29T17:26:10Z)
Exploiting Edge Features for Transferable Adversarial Attacks in Distributed Machine Learning [54.26807397329468]
This work explores a previously overlooked vulnerability in distributed deep learning systems.<n>An adversary who intercepts the intermediate features transmitted between them can still pose a serious threat.<n>We propose an exploitation strategy specifically designed for distributed settings.
arXiv Detail & Related papers (2025-07-09T20:09:00Z)
Toward universal steering and monitoring of AI models [16.303681959333883]
We develop a scalable approach for extracting linear representations of general concepts in large-scale AI models.<n>We show how these representations enable model steering, through which we expose vulnerabilities, misaligned behaviors, and improve model capabilities.
arXiv Detail & Related papers (2025-02-06T01:41:48Z)
Perspectives for Direct Interpretability in Multi-Agent Deep Reinforcement Learning [0.41783829807634765]
Multi-Agent Deep Reinforcement Learning (MADRL) was proven efficient in solving complex problems in robotics or games.<n>This paper advocates for direct interpretability, generating post hoc explanations directly from trained models.<n>We explore modern methods, including relevance backpropagation, knowledge edition, model steering, activation patching, sparse autoencoders and circuit discovery.
arXiv Detail & Related papers (2025-02-02T09:15:27Z)
On the Adversarial Transferability of Generalized "Skip Connections" [83.71752155227888]
Skip connection is an essential ingredient for modern deep models to be deeper and more powerful. We find that using more gradients from the skip connections rather than the residual modules during backpropagation allows one to craft adversarial examples with high transferability. We conduct comprehensive transfer attacks against various models including ResNets, Transformers, Inceptions, Neural Architecture Search, and Large Language Models.
arXiv Detail & Related papers (2024-10-11T16:17:47Z)
Prompt-Driven Contrastive Learning for Transferable Adversarial Attacks [42.18755809782401]
We propose a novel transfer attack method called PDCL-Attack. We formulate an effective prompt-driven feature guidance by harnessing the semantic representation power of text.
arXiv Detail & Related papers (2024-07-30T08:52:16Z)
MergeNet: Knowledge Migration across Heterogeneous Models, Tasks, and Modalities [72.05167902805405]
We present MergeNet, which learns to bridge the gap of parameter spaces of heterogeneous models.<n>The core mechanism of MergeNet lies in the parameter adapter, which operates by querying the source model's low-rank parameters.<n> MergeNet is learned alongside both models, allowing our framework to dynamically transfer and adapt knowledge relevant to the current stage.
arXiv Detail & Related papers (2024-04-20T08:34:39Z)
LVLM-Interpret: An Interpretability Tool for Large Vision-Language Models [50.259006481656094]
We present a novel interactive application aimed towards understanding the internal mechanisms of large vision-language models. Our interface is designed to enhance the interpretability of the image patches, which are instrumental in generating an answer. We present a case study of how our application can aid in understanding failure mechanisms in a popular large multi-modal model: LLaVA.
arXiv Detail & Related papers (2024-04-03T23:57:34Z)
Sim-to-Real Causal Transfer: A Metric Learning Approach to Causally-Aware Interaction Representations [62.48505112245388]
We take an in-depth look at the causal awareness of modern representations of agent interactions. We show that recent representations are already partially resilient to perturbations of non-causal agents. We propose a metric learning approach that regularizes latent representations with causal annotations.
arXiv Detail & Related papers (2023-12-07T18:57:03Z)
Self-Supervised Reinforcement Learning that Transfers using Random Features [41.00256493388967]
We propose a self-supervised reinforcement learning method that enables the transfer of behaviors across tasks with different rewards. Our method is self-supervised in that it can be trained on offline datasets without reward labels, but can then be quickly deployed on new tasks.
arXiv Detail & Related papers (2023-05-26T20:37:06Z)
Enhancing Multiple Reliability Measures via Nuisance-extended Information Bottleneck [77.37409441129995]
In practical scenarios where training data is limited, many predictive signals in the data can be rather from some biases in data acquisition. We consider an adversarial threat model under a mutual information constraint to cover a wider class of perturbations in training. We propose an autoencoder-based training to implement the objective, as well as practical encoder designs to facilitate the proposed hybrid discriminative-generative training.
arXiv Detail & Related papers (2023-03-24T16:03:21Z)
SafeAMC: Adversarial training for robust modulation recognition models [53.391095789289736]
In communication systems, there are many tasks, like modulation recognition, which rely on Deep Neural Networks (DNNs) models. These models have been shown to be susceptible to adversarial perturbations, namely imperceptible additive noise crafted to induce misclassification. We propose to use adversarial training, which consists of fine-tuning the model with adversarial perturbations, to increase the robustness of automatic modulation recognition models.
arXiv Detail & Related papers (2021-05-28T11:29:04Z)

This list is automatically generated from the titles and abstracts of the papers in this site.