Preemptive Detection and Steering of LLM Misalignment via Latent Reachability
- URL: http://arxiv.org/abs/2509.21528v1
- Date: Thu, 25 Sep 2025 20:15:29 GMT
- Title: Preemptive Detection and Steering of LLM Misalignment via Latent Reachability
- Authors: Sathwik Karnik, Somil Bansal,
- Abstract summary: Large language models (LLMs) are now ubiquitous in everyday tools, raising urgent safety concerns about their tendency to generate harmful content.<n>We propose BRT-Align, a reachability-based framework that brings control-theoretic safety tools to LLM inference.
- Score: 8.01833277608166
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Large language models (LLMs) are now ubiquitous in everyday tools, raising urgent safety concerns about their tendency to generate harmful content. The dominant safety approach -- reinforcement learning from human feedback (RLHF) -- effectively shapes model behavior during training but offers no safeguards at inference time, where unsafe continuations may still arise. We propose BRT-Align, a reachability-based framework that brings control-theoretic safety tools to LLM inference. BRT-Align models autoregressive generation as a dynamical system in latent space and learn a safety value function via backward reachability, estimating the worst-case evolution of a trajectory. This enables two complementary mechanisms: (1) a runtime monitor that forecasts unsafe completions several tokens in advance, and (2) a least-restrictive steering filter that minimally perturbs latent states to redirect generation away from unsafe regions. Experiments across multiple LLMs and toxicity benchmarks demonstrate that BRT-Align provides more accurate and earlier detection of unsafe continuations than baselines. Moreover, for LLM safety alignment, BRT-Align substantially reduces unsafe generations while preserving sentence diversity and coherence. Qualitative results further highlight emergent alignment properties: BRT-Align consistently produces responses that are less violent, less profane, less offensive, and less politically biased. Together, these findings demonstrate that reachability analysis provides a principled and practical foundation for inference-time LLM safety.
Related papers
- BarrierSteer: LLM Safety via Learning Barrier Steering [83.12893815611052]
BarrierSteer is a novel framework that formalizes safety by embedding learned non-linear safety constraints directly into the model's latent representation space.<n>We show that BarrierSteer substantially reduces adversarial success rates, decreases unsafe generations, and outperforms existing methods.
arXiv Detail & Related papers (2026-02-23T18:19:46Z) - Understanding and Preserving Safety in Fine-Tuned LLMs [20.821783178639063]
Fine-tuning can substantially degrade safety alignment, even when the fine-tuning data is harmless.<n>We propose safety-preserving fine-tuning (SPF), a lightweight approach that explicitly removes gradient components conflicting with the low-rank safety subspace.<n> SPF consistently maintains downstream task performance and recovers nearly all pre-trained safety alignment, even under adversarial fine-tuning scenarios.
arXiv Detail & Related papers (2026-01-15T07:33:13Z) - RoboSafe: Safeguarding Embodied Agents via Executable Safety Logic [56.38397499463889]
Embodied agents powered by vision-language models (VLMs) are increasingly capable of executing complex real-world tasks.<n>However, they remain vulnerable to hazardous instructions that may trigger unsafe behaviors.<n>We propose RoboSafe, a runtime safeguard for embodied agents through executable predicate-based safety logic.
arXiv Detail & Related papers (2025-12-24T15:01:26Z) - Breaking the Safety-Capability Tradeoff: Reinforcement Learning with Verifiable Rewards Maintains Safety Guardrails in LLMs [3.198812241868092]
reinforcement learning with verifiable rewards (RLVR) has emerged as a promising alternative that optimize models on objectively measurable tasks.<n>We present the first comprehensive theoretical and empirical analysis of safety properties in RLVR.<n> Empirically, we conduct extensive experiments across five adversarial safety benchmarks, demonstrating that RLVR can simultaneously enhance reasoning capabilities while maintaining or improving safety guardrails.
arXiv Detail & Related papers (2025-11-26T04:36:34Z) - UpSafe$^\circ$C: Upcycling for Controllable Safety in Large Language Models [67.91151588917396]
Large Language Models (LLMs) have achieved remarkable progress across a wide range of tasks, but remain vulnerable to safety risks such as harmful content generation and jailbreak attacks.<n>We propose UpSafe$circ$C, a unified framework for enhancing LLM safety through safety-aware upcycling.<n>Our results highlight a new direction for LLM safety: moving from static alignment toward dynamic, modular, and inference-aware control.
arXiv Detail & Related papers (2025-10-02T16:43:33Z) - Anchoring Refusal Direction: Mitigating Safety Risks in Tuning via Projection Constraint [52.878820730054365]
Instruction Fine-Tuning (IFT) has been widely adopted as an effective post-training strategy to enhance abilities of Large Language Models (LLMs)<n>Recent research into the internal mechanisms of LLMs has identified the refusal direction (r-direction) in the hidden states, which plays a pivotal role in governing refusal behavior.<n>To mitigate such drift, our proposed ProCon method introduces a projection-constrained loss term that regularizes the projection magnitude of each training sample's hidden state onto the r-direction.
arXiv Detail & Related papers (2025-09-08T15:24:33Z) - SafeLLM: Unlearning Harmful Outputs from Large Language Models against Jailbreak Attacks [29.963044242980345]
Jailbreak attacks pose a serious threat to the safety of Large Language Models.<n>We propose SafeLLM, a novel unlearning-based defense framework.<n>We show that SafeLLM substantially reduces attack success rates while maintaining high general-purpose performance.
arXiv Detail & Related papers (2025-08-21T02:39:14Z) - Viability of Future Actions: Robust Safety in Reinforcement Learning via Entropy Regularization [47.30677525394649]
We analyze the interplay between two well-established techniques in model-free reinforcement learning: entropy regularization and constraints penalization.<n>We show that entropy regularization in constrained RL inherently biases learning toward maximizing the number of future viable actions, thereby promoting constraints satisfaction robust to action noise.<n>We conclude that the connection between entropy regularization and robustness is a promising avenue for further empirical and theoretical investigation.
arXiv Detail & Related papers (2025-06-12T16:34:19Z) - Improving LLM Safety Alignment with Dual-Objective Optimization [65.41451412400609]
Existing training-time safety alignment techniques for large language models (LLMs) remain vulnerable to jailbreak attacks.<n>We propose an improved safety alignment that disentangles DPO objectives into two components: (1) robust refusal training, which encourages refusal even when partial unsafe generations are produced, and (2) targeted unlearning of harmful knowledge.
arXiv Detail & Related papers (2025-03-05T18:01:05Z) - On Almost Surely Safe Alignment of Large Language Models at Inference-Time [20.5164976103514]
We introduce a novel inference-time alignment approach for LLMs that aims to generate safe responses almost surely.<n>We augment a safety state that tracks the evolution of safety constraints and dynamically penalizes unsafe generations.<n>We demonstrate formal safety guarantees w.r.t. the given cost model upon solving the MDP in the latent space with sufficiently large penalties.
arXiv Detail & Related papers (2025-02-03T09:59:32Z) - SCANS: Mitigating the Exaggerated Safety for LLMs via Safety-Conscious Activation Steering [56.92068213969036]
Safety alignment is indispensable for Large Language Models (LLMs) to defend threats from malicious instructions.<n>Recent researches reveal safety-aligned LLMs prone to reject benign queries due to the exaggerated safety issue.<n>We propose a Safety-Conscious Activation Steering (SCANS) method to mitigate the exaggerated safety concerns.
arXiv Detail & Related papers (2024-08-21T10:01:34Z) - Refuse Whenever You Feel Unsafe: Improving Safety in LLMs via Decoupled Refusal Training [67.30423823744506]
We introduce a novel approach, Decoupled Refusal Training (DeRTa), designed to empower LLMs to refuse compliance to harmful prompts at any response position.<n>DeRTa incorporates two novel components: (1) Maximum Likelihood Estimation with Harmful Response Prefix, which trains models to recognize and avoid unsafe content by appending a segment of harmful response to the beginning of a safe response, and (2) Reinforced Transition Optimization (RTO), which equips models with the ability to transition from potential harm to safety refusal consistently throughout the harmful response sequence.
arXiv Detail & Related papers (2024-07-12T09:36:33Z) - State-Wise Safe Reinforcement Learning With Pixel Observations [12.338614299403305]
We propose a novel pixel-observation safe RL algorithm that efficiently encodes state-wise safety constraints with unknown hazard regions.
As a joint learning framework, our approach begins by constructing a latent dynamics model with low-dimensional latent spaces derived from pixel observations.
We then build and learn a latent barrier-like function on top of the latent dynamics and conduct policy optimization simultaneously, thereby improving both safety and the total expected return.
arXiv Detail & Related papers (2023-11-03T20:32:30Z) - Lyapunov-based uncertainty-aware safe reinforcement learning [0.0]
InReinforcement learning (RL) has shown a promising performance in learning optimal policies for a variety of sequential decision-making tasks.
In many real-world RL problems, besides optimizing the main objectives, the agent is expected to satisfy a certain level of safety.
We propose a Lyapunov-based uncertainty-aware safe RL model to address these limitations.
arXiv Detail & Related papers (2021-07-29T13:08:15Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.