Related papers: Nothing in Excess: Mitigating the Exaggerated Safety for LLMs via Safety-Conscious Activation Steering

Nothing in Excess: Mitigating the Exaggerated Safety for LLMs via Safety-Conscious Activation Steering

URL: http://arxiv.org/abs/2408.11491v1
Date: Wed, 21 Aug 2024 10:01:34 GMT
Title: Nothing in Excess: Mitigating the Exaggerated Safety for LLMs via Safety-Conscious Activation Steering
Authors: Zouying Cao, Yifei Yang, Hai Zhao,
Abstract summary: Safety alignment is indispensable for Large language models (LLMs) to defend threats from malicious instructions. Recent researches reveal safety-aligned LLMs prone to reject benign queries due to the exaggerated safety issue. We propose a Safety-Conscious Activation Steering (SCANS) method to mitigate the exaggerated safety concerns.
Score: 56.92068213969036
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Safety alignment is indispensable for Large language models (LLMs) to defend threats from malicious instructions. However, recent researches reveal safety-aligned LLMs prone to reject benign queries due to the exaggerated safety issue, limiting their helpfulness. In this paper, we propose a Safety-Conscious Activation Steering (SCANS) method to mitigate the exaggerated safety concerns in aligned LLMs. First, SCANS extracts the refusal steering vectors within the activation space and utilizes vocabulary projection to anchor some specific safety-critical layers which influence model refusal behavior. Second, by tracking the hidden state transition, SCANS identifies the steering direction and steers the model behavior accordingly, achieving a balance between exaggerated safety and adequate safety. Experiments show that SCANS achieves new state-of-the-art performance on XSTest and OKTest benchmarks, without impairing their defense capability against harmful queries and maintaining almost unchanged model capability.

Related papers

EASE: Practical and Efficient Safety Alignment for Small Language Models [4.839980912290382]
Small language models (SLMs) are increasingly deployed on edge devices, making their safety alignment crucial yet challenging.<n>We propose EASE, a novel framework that enables practical and Efficient safety alignment for Small languagE models.
arXiv Detail & Related papers (2025-11-09T19:46:54Z)
UpSafe$^\circ$C: Upcycling for Controllable Safety in Large Language Models [67.91151588917396]
Large Language Models (LLMs) have achieved remarkable progress across a wide range of tasks, but remain vulnerable to safety risks such as harmful content generation and jailbreak attacks.<n>We propose UpSafe$circ$C, a unified framework for enhancing LLM safety through safety-aware upcycling.<n>Our results highlight a new direction for LLM safety: moving from static alignment toward dynamic, modular, and inference-aware control.
arXiv Detail & Related papers (2025-10-02T16:43:33Z)
Safe-SAIL: Towards a Fine-grained Safety Landscape of Large Language Models via Sparse Autoencoder Interpretation Framework [31.278770676774325]
We propose Safe-SAIL, a framework for interpreting SAE features within large language models (LLMs)<n>Our approach systematically identifies SAE with best concept-specific interpretability, explains safety-related neurons, and introduces efficient strategies to scale up the interpretation process.
arXiv Detail & Related papers (2025-09-11T11:22:43Z)
Should LLM Safety Be More Than Refusing Harmful Instructions? [6.5137518437747]
This paper presents a systematic evaluation of Large Language Models' (LLMs) behavior on long-tail distributed (encrypted) texts.<n>We introduce a two-dimensional framework for assessing LLM safety.<n>We demonstrate that models that possess capabilities to decrypt ciphers may be susceptible to mismatched-generalization attacks.
arXiv Detail & Related papers (2025-06-03T05:00:12Z)
SafeSteer: Interpretable Safety Steering with Refusal-Evasion in LLMs [7.120986296945107]
This paper investigates an approach called SafeSteer for guiding the outputs of large language models (LLMs)<n>We employ a simple, gradient-free unsupervised method to enhance safety steering while preserving text quality, topic relevance, and without explicit refusal.
arXiv Detail & Related papers (2025-06-01T01:19:37Z)
Learning Safety Constraints for Large Language Models [41.95596134688853]
Large language models (LLMs) pose significant safety risks through harmful outputs and vulnerability to adversarial attacks.<n>We propose SaP, a geometric approach to safety that learns and enforces multiple safety constraints directly in the model's representation space.<n>We develop a framework that identifies safe and unsafe regions via the polytope's facets, enabling both detection and correction of unsafe outputs.
arXiv Detail & Related papers (2025-05-30T10:30:24Z)
Shape it Up! Restoring LLM Safety during Finetuning [66.46166656543761]
Finetuning large language models (LLMs) enables user-specific customization but introduces critical safety risks.<n>We propose dynamic safety shaping (DSS), a framework that uses fine-grained safety signals to reinforce learning from safe segments of a response while suppressing unsafe content.<n>We present STAR-DSS, guided by STAR scores, that robustly mitigates finetuning risks and delivers substantial safety improvements across diverse threats, datasets, and model families.
arXiv Detail & Related papers (2025-05-22T18:05:16Z)
STAIR: Improving Safety Alignment with Introspective Reasoning [44.780098674618614]
We propose STAIR, a framework that integrates SafeTy Alignment with Itrospective Reasoning. We show that STAIR effectively mitigates harmful outputs while better preserving helpfulness, compared to instinctive alignment strategies. With test-time scaling, STAIR achieves a safety performance comparable to Claude-3.5 against popular jailbreak attacks.
arXiv Detail & Related papers (2025-02-04T15:02:55Z)
Internal Activation as the Polar Star for Steering Unsafe LLM Behavior [50.463399903987245]
We introduce SafeSwitch, a framework that dynamically regulates unsafe outputs by monitoring and utilizing the model's internal states. Our empirical results show that SafeSwitch reduces harmful outputs by over 80% on safety benchmarks while maintaining strong utility.
arXiv Detail & Related papers (2025-02-03T04:23:33Z)
Superficial Safety Alignment Hypothesis [8.297367440457508]
We propose the Superficial Safety Alignment Hypothesis (SSAH), which posits that safety alignment should teach an otherwise unsafe model to choose the correct reasoning direction. We identify four types of attribute-critical components in safety-aligned large language models (LLMs) Our findings show that freezing certain safety-critical components 7.5% during fine-tuning allows the model to retain its safety attributes while adapting to new tasks.
arXiv Detail & Related papers (2024-10-07T19:53:35Z)
Safety Layers in Aligned Large Language Models: The Key to LLM Security [43.805905164456846]
Internal parameters can be vulnerable to security degradation when fine-tuned with non-malicious backdoor or normal data. We identify a small set of contiguous layers in the middle of the model that are crucial for distinguishing malicious queries from normal ones. We propose a novel fine-tuning approach, Safely Partial Fine-Tuning (SPPFT), that fixes the gradient of the safety layers during fine-tuning to address the security degradation.
arXiv Detail & Related papers (2024-08-30T04:35:59Z)
Refuse Whenever You Feel Unsafe: Improving Safety in LLMs via Decoupled Refusal Training [67.30423823744506]
This study addresses a critical gap in safety tuning practices for Large Language Models (LLMs) We introduce a novel approach, Decoupled Refusal Training (DeRTa), designed to empower LLMs to refuse compliance to harmful prompts at any response position. DeRTa incorporates two novel components: (1) Maximum Likelihood Estimation with Harmful Response Prefix, which trains models to recognize and avoid unsafe content by appending a segment of harmful response to the beginning of a safe response, and (2) Reinforced Transition Optimization (RTO), which equips models with the ability to transition from potential harm to safety refusal consistently throughout the harmful
arXiv Detail & Related papers (2024-07-12T09:36:33Z)
Safety Arithmetic: A Framework for Test-time Safety Alignment of Language Models by Steering Parameters and Activations [19.132597762214722]
Current alignment methods struggle with dynamic user intentions and complex objectives. We propose Safety Arithmetic, a training-free framework enhancing safety across different scenarios. Our experiments show that Safety Arithmetic significantly improves safety measures, reduces over-safety, and maintains model utility.
arXiv Detail & Related papers (2024-06-17T17:48:13Z)
Navigating the Safety Landscape: Measuring Risks in Finetuning Large Language Models [65.06446825020578]
Safety alignment is crucial to ensure that large language models (LLMs) behave in ways that align with human preferences and prevent harmful actions during inference. We aim to measure the risks in finetuning LLMs through navigating the LLM safety landscape.
arXiv Detail & Related papers (2024-05-27T17:31:56Z)
Towards Comprehensive and Efficient Post Safety Alignment of Large Language Models via Safety Patching [77.36097118561057]
textscSafePatching is a novel framework for comprehensive and efficient PSA. textscSafePatching achieves a more comprehensive and efficient PSA than baseline methods.
arXiv Detail & Related papers (2024-05-22T16:51:07Z)
On Prompt-Driven Safeguarding for Large Language Models [172.13943777203377]
We find that in the representation space, the input queries are typically moved by safety prompts in a "higher-refusal" direction. Inspired by these findings, we propose a method for safety prompt optimization, namely DRO. Treating a safety prompt as continuous, trainable embeddings, DRO learns to move the queries' representations along or opposite the refusal direction, depending on their harmfulness.
arXiv Detail & Related papers (2024-01-31T17:28:24Z)
The Art of Defending: A Systematic Evaluation and Analysis of LLM Defense Strategies on Safety and Over-Defensiveness [56.174255970895466]
Large Language Models (LLMs) play an increasingly pivotal role in natural language processing applications. This paper presents Safety and Over-Defensiveness Evaluation (SODE) benchmark.
arXiv Detail & Related papers (2023-12-30T17:37:06Z)

This list is automatically generated from the titles and abstracts of the papers in this site.