Universal Refusal Circuits Across LLMs: Cross-Model Transfer via Trajectory Replay and Concept-Basis Reconstruction
- URL: http://arxiv.org/abs/2601.16034v2
- Date: Sun, 25 Jan 2026 22:07:21 GMT
- Title: Universal Refusal Circuits Across LLMs: Cross-Model Transfer via Trajectory Replay and Concept-Basis Reconstruction
- Authors: Tony Cristofano,
- Abstract summary: We introduce Trajectory Replay via Concept-Basis Reconstruction, a framework that transfers refusal interventions from donor to target models.<n>By aligning layers via concept fingerprints and reconstructing refusal directions using a shared recipe'' of concept atoms, we map the donor's ablation trajectory into the target's semantic space.<n>Our evaluation confirms that these transferred recipes consistently attenuate refusal while maintaining performance.
- Score: 0.0
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Refusal behavior in aligned LLMs is often viewed as model-specific, yet we hypothesize it stems from a universal, low-dimensional semantic circuit shared across models. To test this, we introduce Trajectory Replay via Concept-Basis Reconstruction, a framework that transfers refusal interventions from donor to target models, spanning diverse architectures (e.g., Dense to MoE) and training regimes, without using target-side refusal supervision. By aligning layers via concept fingerprints and reconstructing refusal directions using a shared ``recipe'' of concept atoms, we map the donor's ablation trajectory into the target's semantic space. To preserve capabilities, we introduce a weight-SVD stability guard that projects interventions away from high-variance weight subspaces to prevent collateral damage. Our evaluation across 8 model pairs confirms that these transferred recipes consistently attenuate refusal while maintaining performance, providing strong evidence for the semantic universality of safety alignment.
Related papers
- BadRSSD: Backdoor Attacks on Regularized Self-Supervised Diffusion Models [10.286339414754499]
Bad RSSD is the first backdoor attack targeting the representation layer of self-supervised diffusion models.<n>It hijacks the semantic representations of poisoned samples with triggers in PCA space toward those of a target image.<n>Bad RSSD substantially outperforms existing attacks in both FID and MSE metrics.
arXiv Detail & Related papers (2026-03-01T09:56:26Z) - Differential Vector Erasure: Unified Training-Free Concept Erasure for Flow Matching Models [49.10620605347065]
We propose Differential Vector Erasure (DVE), a training-free concept erasure method specifically designed for flow matching models.<n>Our key insight is that semantic concepts are implicitly encoded in the directional structure of the velocity field governing the generative flow.<n>During inference, DVE selectively removes concept-specific components by projecting the velocity field onto the differential direction, enabling precise concept suppression without affecting irrelevant semantics.
arXiv Detail & Related papers (2026-02-01T08:05:45Z) - Rethinking Transferable Adversarial Attacks on Point Clouds from a Compact Subspace Perspective [55.919842734983156]
CoSA is a transferable attack framework that operates within a shared low-dimensional semantic space.<n>CoSA consistently outperforms state-of-the-art transferable attacks.
arXiv Detail & Related papers (2026-01-30T15:48:11Z) - LURE: Latent Space Unblocking for Multi-Concept Reawakening in Diffusion Models [24.332916173317113]
Concept erasure aims to suppress sensitive content in diffusion models.<n>Recent studies show that erased concepts can still be reawakened, revealing vulnerabilities in erasure methods.<n>We model the generation process as an implicit function to enable a comprehensive theoretical analysis of multiple factors.
arXiv Detail & Related papers (2026-01-20T10:39:11Z) - Sparse Concept Anchoring for Interpretable and Controllable Neural Representations [0.9831489366502301]
We introduce Sparse Concept Anchoring, a method that biases latent space to position a targeted subset of concepts.<n>The anchored geometry enables two practical interventions: behavioral steering that projects out a concept's latent component at inference, and permanent removal.
arXiv Detail & Related papers (2025-12-13T21:43:17Z) - Keep Calm and Avoid Harmful Content: Concept Alignment and Latent Manipulation Towards Safer Answers [0.0]
Large Language Models are susceptible to jailbreak attacks that bypass built-in safety guardrails.<n>We propose Concept Alignment and Concept Manipulation CALM, an inference-time method that suppresses harmful concepts by modifying latent representations.
arXiv Detail & Related papers (2025-10-14T16:08:22Z) - Revoking Amnesia: RL-based Trajectory Optimization to Resurrect Erased Concepts in Diffusion Models [38.38751366738881]
Concept erasure techniques have been widely deployed in T2I diffusion models to prevent inappropriate content generation for safety and copyright considerations.<n> established erasure methods exhibit degraded effectiveness, raising questions about their true mechanisms.<n>We propose textbfRevAm, a trajectory optimization framework that resurrects erased concepts by dynamically steering the denoising process.
arXiv Detail & Related papers (2025-09-30T07:46:19Z) - AdvChain: Adversarial Chain-of-Thought Tuning for Robust Safety Alignment of Large Reasoning Models [62.70575022567081]
We propose AdvChain, an alignment paradigm that teaches models dynamic self-correction through adversarial CoT tuning.<n>Our work establishes a new direction for building more robust and reliable reasoning models.
arXiv Detail & Related papers (2025-09-29T04:27:23Z) - Interpretable Few-Shot Image Classification via Prototypical Concept-Guided Mixture of LoRA Experts [79.18608192761512]
Self-Explainable Models (SEMs) rely on Prototypical Concept Learning (PCL) to enable their visual recognition processes more interpretable.<n>We propose a Few-Shot Prototypical Concept Classification framework that mitigates two key challenges under low-data regimes: parametric imbalance and representation misalignment.<n>Our approach consistently outperforms existing SEMs by a notable margin, with 4.2%-8.7% relative gains in 5-way 5-shot classification.
arXiv Detail & Related papers (2025-06-05T06:39:43Z) - Robust Anti-Backdoor Instruction Tuning in LVLMs [53.766434746801366]
We introduce a lightweight, certified-agnostic defense framework for large visual language models (LVLMs)<n>Our framework finetunes only adapter modules and text embedding layers under instruction tuning.<n>Experiments against seven attacks on Flickr30k and MSCOCO demonstrate that ours reduces their attack success rate to nearly zero.
arXiv Detail & Related papers (2025-06-04T01:23:35Z) - REFINE: Inversion-Free Backdoor Defense via Model Reprogramming [60.554146386198376]
Backdoor attacks on deep neural networks (DNNs) have emerged as a significant security threat.<n>We propose REFINE, an inversion-free backdoor defense method based on model reprogramming.
arXiv Detail & Related papers (2025-02-22T07:29:12Z) - Transferable Adversarial Attacks on SAM and Its Downstream Models [87.23908485521439]
This paper explores the feasibility of adversarial attacking various downstream models fine-tuned from the segment anything model (SAM)<n>To enhance the effectiveness of the adversarial attack towards models fine-tuned on unknown datasets, we propose a universal meta-initialization (UMI) algorithm.
arXiv Detail & Related papers (2024-10-26T15:04:04Z) - Backpropagation Path Search On Adversarial Transferability [35.71353415348786]
Transfer-based attackers craft adversarial examples against surrogate models and transfer them to victim models.
Structure-based attackers adjust the backpropagation path to avoid the attack from overfitting the surrogate model.
Existing structure-based attackers fail to explore the convolution module in CNNs and modify the backpropagation graph.
arXiv Detail & Related papers (2023-08-15T08:21:20Z) - A Self-supervised Approach for Adversarial Robustness [105.88250594033053]
Adversarial examples can cause catastrophic mistakes in Deep Neural Network (DNNs) based vision systems.
This paper proposes a self-supervised adversarial training mechanism in the input space.
It provides significant robustness against the textbfunseen adversarial attacks.
arXiv Detail & Related papers (2020-06-08T20:42:39Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.