LatentGuard: Controllable Latent Steering for Robust Refusal of Attacks and Reliable Response Generation
- URL: http://arxiv.org/abs/2509.19839v1
- Date: Wed, 24 Sep 2025 07:31:54 GMT
- Title: LatentGuard: Controllable Latent Steering for Robust Refusal of Attacks and Reliable Response Generation
- Authors: Huizhen Shu, Xuying Li, Zhuo Li,
- Abstract summary: LATENTGUARD is a framework that combines behavioral alignment with supervised latent space control for interpretable and precise safety steering.<n>Our results show significant improvements in both safety controllability and response interpretability without compromising utility.
- Score: 4.29885665563186
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: Achieving robust safety alignment in large language models (LLMs) while preserving their utility remains a fundamental challenge. Existing approaches often struggle to balance comprehensive safety with fine-grained controllability at the representation level. We introduce LATENTGUARD, a novel three-stage framework that combines behavioral alignment with supervised latent space control for interpretable and precise safety steering. Our approach begins by fine-tuning an LLM on rationalized datasets containing both reasoning-enhanced refusal responses to adversarial prompts and reasoning-enhanced normal responses to benign queries, establishing robust behavioral priors across both safety-critical and utility-preserving scenarios. We then train a structured variational autoencoder (VAE) on intermediate MLP activations, supervised by multi-label annotations including attack types, attack methods, and benign indicators. This supervision enables the VAE to learn disentangled latent representations that capture distinct adversarial characteristics while maintaining semantic interpretability. Through targeted manipulation of learned latent dimensions, LATENTGUARD achieves selective refusal behavior, effectively blocking harmful requests while preserving helpfulness for legitimate use cases. Experiments on Qwen3-8B demonstrate significant improvements in both safety controllability and response interpretability without compromising utility. Cross-architecture validation on Mistral-7B confirms the generalizability of our latent steering approach, showing consistent effectiveness across different model families. Our results suggest that structured representation-level intervention offers a promising pathway toward building safer yet practical LLM systems.
Related papers
- TraceGuard: Process-Guided Firewall against Reasoning Backdoors in Large Language Models [19.148124494194317]
We propose TraceGuard, a process-guided security framework that transforms small-scale models into robust reasoning firewalls.<n>Our approach treats the reasoning trace as an untrusted payload and establishes a defense-in-depth strategy.<n>We demonstrate robustness against adaptive adversaries in a grey-box setting, establishing TraceGuard as a viable, low-latency security primitive.
arXiv Detail & Related papers (2026-03-02T22:19:13Z) - SAGE-LLM: Towards Safe and Generalizable LLM Controller with Fuzzy-CBF Verification and Graph-Structured Knowledge Retrieval for UAV Decision [46.089736018739295]
Large Language Models (LLM) lack domain-specific UAV control knowledge and formal safety assurances.<n>This paper proposes a train-free two-layer decision architecture based on LLMs, integrating high-level safety planning with low-level precise control.
arXiv Detail & Related papers (2026-02-27T06:41:04Z) - BarrierSteer: LLM Safety via Learning Barrier Steering [83.12893815611052]
BarrierSteer is a novel framework that formalizes safety by embedding learned non-linear safety constraints directly into the model's latent representation space.<n>We show that BarrierSteer substantially reduces adversarial success rates, decreases unsafe generations, and outperforms existing methods.
arXiv Detail & Related papers (2026-02-23T18:19:46Z) - Self-Guard: Defending Large Reasoning Models via enhanced self-reflection [54.775612141528164]
Self-Guard is a lightweight safety defense framework for Large Reasoning Models.<n>It bridges the awareness-compliance gap, achieving robust safety performance without compromising model utility.<n>Self-Guard exhibits strong generalization across diverse unseen risks and varying model scales.
arXiv Detail & Related papers (2026-01-31T13:06:11Z) - From Passive Metric to Active Signal: The Evolving Role of Uncertainty Quantification in Large Language Models [77.04403907729738]
This survey charts the evolution of uncertainty from a passive diagnostic metric to an active control signal guiding real-time model behavior.<n>We demonstrate how uncertainty is leveraged as an active control signal across three frontiers.<n>This survey argues that mastering the new trend of uncertainty is essential for building the next generation of scalable, reliable, and trustworthy AI.
arXiv Detail & Related papers (2026-01-22T06:21:31Z) - Reasoning over Precedents Alongside Statutes: Case-Augmented Deliberative Alignment for LLM Safety [59.01189713115365]
We evaluate the impact of explicitly specifying extensive safety codes versus demonstrating them through illustrative cases.<n>We find that referencing explicit codes inconsistently improves harmlessness and systematically degrades helpfulness.<n>We propose CADA, a case-augmented deliberative alignment method for LLMs utilizing reinforcement learning on self-generated safety reasoning chains.
arXiv Detail & Related papers (2026-01-12T21:08:46Z) - Embedding Poisoning: Bypassing Safety Alignment via Embedding Semantic Shift [23.0914017433021]
This work identifies a novel class of deployment phase attacks that exploit a vulnerability by injecting imperceptible perturbations directly into the embedding layer outputs without modifying model weights or input text.<n>We propose Search based Embedding Poisoning, a practical, model agnostic framework that introduces carefully optimized perturbations into embeddings associated with high risk tokens.
arXiv Detail & Related papers (2025-09-08T05:00:58Z) - Rethinking Safety in LLM Fine-tuning: An Optimization Perspective [56.31306558218838]
We show that poor optimization choices, rather than inherent trade-offs, often cause safety problems, measured as harmful responses to adversarial prompts.<n>We propose a simple exponential moving average (EMA) momentum technique in parameter space that preserves safety performance.<n>Our experiments on the Llama families across multiple datasets demonstrate that safety problems can largely be avoided without specialized interventions.
arXiv Detail & Related papers (2025-08-17T23:46:36Z) - ARMOR: Aligning Secure and Safe Large Language Models via Meticulous Reasoning [49.47193675702453]
Large Language Models (LLMs) have demonstrated remarkable generative capabilities.<n>LLMs remain vulnerable to malicious instructions that can bypass safety constraints.<n>We propose a reasoning-based safety alignment framework, ARMOR, that replaces the ad-hoc chains of thought reasoning process with human-aligned, structured one.
arXiv Detail & Related papers (2025-07-14T09:05:54Z) - Probing the Robustness of Large Language Models Safety to Latent Perturbations [30.16804362984161]
Safety alignment is a key requirement for building reliable Artificial General Intelligence.<n>We observe that minor latent shifts can still trigger unsafe responses in aligned models.<n>We introduce Layer-wise Adversarial Patch Training(LAPT), a fine-tuning strategy that injects controlled perturbations into hidden representations during training.
arXiv Detail & Related papers (2025-06-19T07:03:05Z) - SafeSteer: Interpretable Safety Steering with Refusal-Evasion in LLMs [7.120986296945107]
This paper investigates an approach called SafeSteer for guiding the outputs of large language models (LLMs)<n>We employ a simple, gradient-free unsupervised method to enhance safety steering while preserving text quality, topic relevance, and without explicit refusal.
arXiv Detail & Related papers (2025-06-01T01:19:37Z) - Safety Alignment Can Be Not Superficial With Explicit Safety Signals [8.297367440457508]
Recent studies on the safety alignment of large language models (LLMs) have revealed that existing approaches often operate superficially.<n>This paper identifies a fundamental cause of this superficiality: existing alignment approaches presume that models can implicitly learn a safety-related reasoning task during the alignment process.<n>By explicitly introducing a safety-related binary classification task and integrating its signals with our attention and decoding strategies, we eliminate this ambiguity.
arXiv Detail & Related papers (2025-05-19T20:40:46Z) - Reasoning-to-Defend: Safety-Aware Reasoning Can Defend Large Language Models from Jailbreaking [54.10710423370126]
We propose Reasoning-to-Defend (R2D), a training paradigm that integrates a safety-aware reasoning mechanism into Large Language Models' generation process.<n>CPO enhances the model's perception of the safety status of given dialogues.<n>Experiments demonstrate that R2D effectively mitigates various attacks and improves overall safety, while maintaining the original performances.
arXiv Detail & Related papers (2025-02-18T15:48:46Z) - SCANS: Mitigating the Exaggerated Safety for LLMs via Safety-Conscious Activation Steering [56.92068213969036]
Safety alignment is indispensable for Large Language Models (LLMs) to defend threats from malicious instructions.<n>Recent researches reveal safety-aligned LLMs prone to reject benign queries due to the exaggerated safety issue.<n>We propose a Safety-Conscious Activation Steering (SCANS) method to mitigate the exaggerated safety concerns.
arXiv Detail & Related papers (2024-08-21T10:01:34Z) - On Prompt-Driven Safeguarding for Large Language Models [172.13943777203377]
We find that in the representation space, the input queries are typically moved by safety prompts in a "higher-refusal" direction.
Inspired by these findings, we propose a method for safety prompt optimization, namely DRO.
Treating a safety prompt as continuous, trainable embeddings, DRO learns to move the queries' representations along or opposite the refusal direction, depending on their harmfulness.
arXiv Detail & Related papers (2024-01-31T17:28:24Z) - Scalable Synthesis of Verified Controllers in Deep Reinforcement
Learning [0.0]
We propose an automated verification pipeline capable of synthesizing high-quality safety shields.
Our key insight involves separating safety verification from neural controller, using pre-computed verified safety shields to constrain neural controller training.
Experimental results over a range of realistic high-dimensional deep RL benchmarks demonstrate the effectiveness of our approach.
arXiv Detail & Related papers (2021-04-20T19:30:29Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.