Related papers: Universal Refusal Circuits Across LLMs: Cross-Model Transfer via Trajectory Replay and Concept-Basis Reconstruction

Universal Refusal Circuits Across LLMs: Cross-Model Transfer via Trajectory Replay and Concept-Basis Reconstruction

URL: http://arxiv.org/abs/2601.16034v2
Date: Sun, 25 Jan 2026 22:07:21 GMT
Title: Universal Refusal Circuits Across LLMs: Cross-Model Transfer via Trajectory Replay and Concept-Basis Reconstruction
Authors: Tony Cristofano,
Abstract summary: We introduce Trajectory Replay via Concept-Basis Reconstruction, a framework that transfers refusal interventions from donor to target models.<n>By aligning layers via concept fingerprints and reconstructing refusal directions using a shared recipe'' of concept atoms, we map the donor's ablation trajectory into the target's semantic space.<n>Our evaluation confirms that these transferred recipes consistently attenuate refusal while maintaining performance.
Score: 0.0
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Refusal behavior in aligned LLMs is often viewed as model-specific, yet we hypothesize it stems from a universal, low-dimensional semantic circuit shared across models. To test this, we introduce Trajectory Replay via Concept-Basis Reconstruction, a framework that transfers refusal interventions from donor to target models, spanning diverse architectures (e.g., Dense to MoE) and training regimes, without using target-side refusal supervision. By aligning layers via concept fingerprints and reconstructing refusal directions using a shared ``recipe'' of concept atoms, we map the donor's ablation trajectory into the target's semantic space. To preserve capabilities, we introduce a weight-SVD stability guard that projects interventions away from high-variance weight subspaces to prevent collateral damage. Our evaluation across 8 model pairs confirms that these transferred recipes consistently attenuate refusal while maintaining performance, providing strong evidence for the semantic universality of safety alignment.

Related papers

BadRSSD: Backdoor Attacks on Regularized Self-Supervised Diffusion Models [10.286339414754499]
Bad RSSD is the first backdoor attack targeting the representation layer of self-supervised diffusion models.<n>It hijacks the semantic representations of poisoned samples with triggers in PCA space toward those of a target image.<n>Bad RSSD substantially outperforms existing attacks in both FID and MSE metrics.
arXiv Detail & Related papers (2026-03-01T09:56:26Z)
Differential Vector Erasure: Unified Training-Free Concept Erasure for Flow Matching Models [49.10620605347065]
We propose Differential Vector Erasure (DVE), a training-free concept erasure method specifically designed for flow matching models.<n>Our key insight is that semantic concepts are implicitly encoded in the directional structure of the velocity field governing the generative flow.<n>During inference, DVE selectively removes concept-specific components by projecting the velocity field onto the differential direction, enabling precise concept suppression without affecting irrelevant semantics.
arXiv Detail & Related papers (2026-02-01T08:05:45Z)
Rethinking Transferable Adversarial Attacks on Point Clouds from a Compact Subspace Perspective [55.919842734983156]
CoSA is a transferable attack framework that operates within a shared low-dimensional semantic space.<n>CoSA consistently outperforms state-of-the-art transferable attacks.
arXiv Detail & Related papers (2026-01-30T15:48:11Z)
LURE: Latent Space Unblocking for Multi-Concept Reawakening in Diffusion Models [24.332916173317113]
Concept erasure aims to suppress sensitive content in diffusion models.<n>Recent studies show that erased concepts can still be reawakened, revealing vulnerabilities in erasure methods.<n>We model the generation process as an implicit function to enable a comprehensive theoretical analysis of multiple factors.
arXiv Detail & Related papers (2026-01-20T10:39:11Z)
Sparse Concept Anchoring for Interpretable and Controllable Neural Representations [0.9831489366502301]
We introduce Sparse Concept Anchoring, a method that biases latent space to position a targeted subset of concepts.<n>The anchored geometry enables two practical interventions: behavioral steering that projects out a concept's latent component at inference, and permanent removal.
arXiv Detail & Related papers (2025-12-13T21:43:17Z)
Keep Calm and Avoid Harmful Content: Concept Alignment and Latent Manipulation Towards Safer Answers [0.0]
Large Language Models are susceptible to jailbreak attacks that bypass built-in safety guardrails.<n>We propose Concept Alignment and Concept Manipulation CALM, an inference-time method that suppresses harmful concepts by modifying latent representations.
arXiv Detail & Related papers (2025-10-14T16:08:22Z)
Revoking Amnesia: RL-based Trajectory Optimization to Resurrect Erased Concepts in Diffusion Models [38.38751366738881]
Concept erasure techniques have been widely deployed in T2I diffusion models to prevent inappropriate content generation for safety and copyright considerations.<n> established erasure methods exhibit degraded effectiveness, raising questions about their true mechanisms.<n>We propose textbfRevAm, a trajectory optimization framework that resurrects erased concepts by dynamically steering the denoising process.
arXiv Detail & Related papers (2025-09-30T07:46:19Z)
AdvChain: Adversarial Chain-of-Thought Tuning for Robust Safety Alignment of Large Reasoning Models [62.70575022567081]
We propose AdvChain, an alignment paradigm that teaches models dynamic self-correction through adversarial CoT tuning.<n>Our work establishes a new direction for building more robust and reliable reasoning models.
arXiv Detail & Related papers (2025-09-29T04:27:23Z)
Interpretable Few-Shot Image Classification via Prototypical Concept-Guided Mixture of LoRA Experts [79.18608192761512]
Self-Explainable Models (SEMs) rely on Prototypical Concept Learning (PCL) to enable their visual recognition processes more interpretable.<n>We propose a Few-Shot Prototypical Concept Classification framework that mitigates two key challenges under low-data regimes: parametric imbalance and representation misalignment.<n>Our approach consistently outperforms existing SEMs by a notable margin, with 4.2%-8.7% relative gains in 5-way 5-shot classification.
arXiv Detail & Related papers (2025-06-05T06:39:43Z)
Robust Anti-Backdoor Instruction Tuning in LVLMs [53.766434746801366]
We introduce a lightweight, certified-agnostic defense framework for large visual language models (LVLMs)<n>Our framework finetunes only adapter modules and text embedding layers under instruction tuning.<n>Experiments against seven attacks on Flickr30k and MSCOCO demonstrate that ours reduces their attack success rate to nearly zero.
arXiv Detail & Related papers (2025-06-04T01:23:35Z)
REFINE: Inversion-Free Backdoor Defense via Model Reprogramming [60.554146386198376]
Backdoor attacks on deep neural networks (DNNs) have emerged as a significant security threat.<n>We propose REFINE, an inversion-free backdoor defense method based on model reprogramming.
arXiv Detail & Related papers (2025-02-22T07:29:12Z)
Transferable Adversarial Attacks on SAM and Its Downstream Models [87.23908485521439]
This paper explores the feasibility of adversarial attacking various downstream models fine-tuned from the segment anything model (SAM)<n>To enhance the effectiveness of the adversarial attack towards models fine-tuned on unknown datasets, we propose a universal meta-initialization (UMI) algorithm.
arXiv Detail & Related papers (2024-10-26T15:04:04Z)
Backpropagation Path Search On Adversarial Transferability [35.71353415348786]
Transfer-based attackers craft adversarial examples against surrogate models and transfer them to victim models. Structure-based attackers adjust the backpropagation path to avoid the attack from overfitting the surrogate model. Existing structure-based attackers fail to explore the convolution module in CNNs and modify the backpropagation graph.
arXiv Detail & Related papers (2023-08-15T08:21:20Z)
A Self-supervised Approach for Adversarial Robustness [105.88250594033053]
Adversarial examples can cause catastrophic mistakes in Deep Neural Network (DNNs) based vision systems. This paper proposes a self-supervised adversarial training mechanism in the input space. It provides significant robustness against the textbfunseen adversarial attacks.
arXiv Detail & Related papers (2020-06-08T20:42:39Z)

This list is automatically generated from the titles and abstracts of the papers in this site.