Related papers: Steering in the Shadows: Causal Amplification for Activation Space Attacks in Large Language Models

Steering in the Shadows: Causal Amplification for Activation Space Attacks in Large Language Models

URL: http://arxiv.org/abs/2511.17194v1
Date: Fri, 21 Nov 2025 12:19:55 GMT
Title: Steering in the Shadows: Causal Amplification for Activation Space Attacks in Large Language Models
Authors: Zhiyuan Xu, Stanislav Abaimov, Joseph Gardiner, Sana Belguith,
Abstract summary: We show that intermediate activations in decoder-only large language models (LLMs) form a vulnerable attack surface for behavioral control.<n>We exploit this as an attack surface via Sensitivity-Scaled Steering (SSS), a progressive activation-level attack.<n>We show that SSS induces large shifts in evil, hallucination, sycophancy, and sentiment while preserving high coherence and general capabilities.
Score: 8.92145245069646
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Modern large language models (LLMs) are typically secured by auditing data, prompts, and refusal policies, while treating the forward pass as an implementation detail. We show that intermediate activations in decoder-only LLMs form a vulnerable attack surface for behavioral control. Building on recent findings on attention sinks and compression valleys, we identify a high-gain region in the residual stream where small, well-aligned perturbations are causally amplified along the autoregressive trajectory--a Causal Amplification Effect (CAE). We exploit this as an attack surface via Sensitivity-Scaled Steering (SSS), a progressive activation-level attack that combines beginning-of-sequence (BOS) anchoring with sensitivity-based reinforcement to focus a limited perturbation budget on the most vulnerable layers and tokens. We show that across multiple open-weight models and four behavioral axes, SSS induces large shifts in evil, hallucination, sycophancy, and sentiment while preserving high coherence and general capabilities, turning activation steering into a concrete security concern for white-box and supply-chain LLM deployments.

Related papers

PA-Attack: Guiding Gray-Box Attacks on LVLM Vision Encoders with Prototypes and Attention [63.63231191403825]
Large Vision-Language Models (LVLMs) are foundational to modern multimodal applications, yet their susceptibility to adversarial attacks remains a critical concern.<n>We introduce PA-Attack (Prototype-Anchored Attentive Attack) to tackle the attribute-restricted issue and limited task generalization of vanilla attacks.<n>Experiments show that PA-Attack achieves an average 75.1% score reduction rate (SRR), demonstrating strong attack effectiveness, efficiency, and task generalization in LVLMs.
arXiv Detail & Related papers (2026-02-23T01:20:43Z)
Visual Backdoor Attacks on MLLM Embodied Decision Making via Contrastive Trigger Learning [89.1856483797116]
We introduce BEAT, the first framework to inject visual backdoors into MLLM-based embodied agents.<n>Unlike textual triggers, object triggers exhibit wide variation across viewpoints and lighting, making them difficult to implant reliably.<n>BEAT achieves attack success rates up to 80%, while maintaining strong benign task performance.
arXiv Detail & Related papers (2025-10-31T16:50:49Z)
FORCE: Transferable Visual Jailbreaking Attacks via Feature Over-Reliance CorrEction [82.6826848085638]
Visual jailbreaking attacks can manipulate open-source MLLMs more readily than sophisticated textual attacks.<n>These attacks exhibit extremely limited cross-model transferability, failing to reliably identify vulnerabilities in closed-source MLLMs.<n>We propose a Feature Over-Reliance CorrEction (FORCE) method, which guides the attack to explore broader feasible regions.
arXiv Detail & Related papers (2025-09-25T11:36:56Z)
Embedding Poisoning: Bypassing Safety Alignment via Embedding Semantic Shift [23.0914017433021]
This work identifies a novel class of deployment phase attacks that exploit a vulnerability by injecting imperceptible perturbations directly into the embedding layer outputs without modifying model weights or input text.<n>We propose Search based Embedding Poisoning, a practical, model agnostic framework that introduces carefully optimized perturbations into embeddings associated with high risk tokens.
arXiv Detail & Related papers (2025-09-08T05:00:58Z)
Probing the Robustness of Large Language Models Safety to Latent Perturbations [30.16804362984161]
Safety alignment is a key requirement for building reliable Artificial General Intelligence.<n>We observe that minor latent shifts can still trigger unsafe responses in aligned models.<n>We introduce Layer-wise Adversarial Patch Training(LAPT), a fine-tuning strategy that injects controlled perturbations into hidden representations during training.
arXiv Detail & Related papers (2025-06-19T07:03:05Z)
Representation Bending for Large Language Model Safety [27.842146980762934]
Large Language Models (LLMs) have emerged as powerful tools, but their inherent safety risks pose significant challenges.<n>This paper introduces RepBend, a novel approach that fundamentally disrupts the representations underlying harmful behaviors in LLMs.<n>RepBend achieves state-of-the-art performance, outperforming prior methods such as Circuit Breaker, RMU, and NPO, with up to 95% reduction in attack success rates.
arXiv Detail & Related papers (2025-04-02T09:47:01Z)
Black-Box Adversarial Attack on Vision Language Models for Autonomous Driving [65.61999354218628]
We take the first step toward designing black-box adversarial attacks specifically targeting vision-language models (VLMs) in autonomous driving systems.<n>We propose Cascading Adversarial Disruption (CAD), which targets low-level reasoning breakdown by generating and injecting semantics.<n>We present Risky Scene Induction, which addresses dynamic adaptation by leveraging a surrogate VLM to understand and construct high-level risky scenarios.
arXiv Detail & Related papers (2025-01-23T11:10:02Z)
Spot Risks Before Speaking! Unraveling Safety Attention Heads in Large Vision-Language Models [9.318094073527563]
Internal activations of large vision-language models (LVLMs) can identify malicious prompts across different attacks.<n>This inherent safety perception is governed by sparse attention heads, which we term safety heads"<n>By locating these safety heads and concatenating their activations, we construct a straightforward but powerful malicious prompt detector.
arXiv Detail & Related papers (2025-01-03T07:01:15Z)
Multi-granular Adversarial Attacks against Black-box Neural Ranking Models [111.58315434849047]
We create high-quality adversarial examples by incorporating multi-granular perturbations. We transform the multi-granular attack into a sequential decision-making process. Our attack method surpasses prevailing baselines in both attack effectiveness and imperceptibility.
arXiv Detail & Related papers (2024-04-02T02:08:29Z)
Trojan Activation Attack: Red-Teaming Large Language Models using Activation Steering for Safety-Alignment [31.24530091590395]
We study an attack scenario called Trojan Activation Attack (TA2), which injects trojan steering vectors into the activation layers of Large Language Models. Our experiment results show that TA2 is highly effective and adds little or no overhead to attack efficiency.
arXiv Detail & Related papers (2023-11-15T23:07:40Z)
Visual Adversarial Examples Jailbreak Aligned Large Language Models [66.53468356460365]
We show that the continuous and high-dimensional nature of the visual input makes it a weak link against adversarial attacks. We exploit visual adversarial examples to circumvent the safety guardrail of aligned LLMs with integrated vision. Our study underscores the escalating adversarial risks associated with the pursuit of multimodality.
arXiv Detail & Related papers (2023-06-22T22:13:03Z)
Removing Adversarial Noise in Class Activation Feature Space [160.78488162713498]
We propose to remove adversarial noise by implementing a self-supervised adversarial training mechanism in a class activation feature space. We train a denoising model to minimize the distances between the adversarial examples and the natural examples in the class activation feature space. Empirical evaluations demonstrate that our method could significantly enhance adversarial robustness in comparison to previous state-of-the-art approaches.
arXiv Detail & Related papers (2021-04-19T10:42:24Z)

This list is automatically generated from the titles and abstracts of the papers in this site.