Related papers: Advancing LLM Safe Alignment with Safety Representation Ranking

Advancing LLM Safe Alignment with Safety Representation Ranking

URL: http://arxiv.org/abs/2505.15710v1
Date: Wed, 21 May 2025 16:21:29 GMT
Title: Advancing LLM Safe Alignment with Safety Representation Ranking
Authors: Tianqi Du, Zeming Wei, Quan Chen, Chenheng Zhang, Yisen Wang,
Abstract summary: We propose a listwise ranking framework that selects safe responses using hidden states from the model itself.<n>Our approach directly leverages internal model states and supervision at the list level to capture subtle safety signals.
Score: 21.525021778951313
License: http://creativecommons.org/licenses/by/4.0/
Abstract: The rapid advancement of large language models (LLMs) has demonstrated milestone success in a variety of tasks, yet their potential for generating harmful content has raised significant safety concerns. Existing safety evaluation approaches typically operate directly on textual responses, overlooking the rich information embedded in the model's internal representations. In this paper, we propose Safety Representation Ranking (SRR), a listwise ranking framework that selects safe responses using hidden states from the LLM itself. SRR encodes both instructions and candidate completions using intermediate transformer representations and ranks candidates via a lightweight similarity-based scorer. Our approach directly leverages internal model states and supervision at the list level to capture subtle safety signals. Experiments across multiple benchmarks show that SRR significantly improves robustness to adversarial prompts. Our code will be available upon publication.

Related papers

SAFER: Probing Safety in Reward Models with Sparse Autoencoder [15.804171763844323]
We present sparse Autoencoder For Enhanced Reward model (textbfSAFER)<n>We uncover human-interpretable features in reward model activations, enabling insight into safety-relevant decision-making.<n>Experiments show that SAFER can precisely degrade or enhance safety alignment with minimal data modification.
arXiv Detail & Related papers (2025-07-01T11:04:03Z)
HoliSafe: Holistic Safety Benchmarking and Modeling with Safety Meta Token for Vision-Language Model [52.72318433518926]
Existing safety-tuning datasets and benchmarks only partially consider how image-text interactions can yield harmful content.<n>We introduce a holistic safety dataset and benchmark, HoliSafe, that spans all five safe/unsafe image-text combinations.<n>We propose SafeLLaVA, a novel VLM augmented with a learnable safety meta token and a dedicated safety head.
arXiv Detail & Related papers (2025-06-05T07:26:34Z)
Shape it Up! Restoring LLM Safety during Finetuning [66.46166656543761]
Finetuning large language models (LLMs) enables user-specific customization but introduces critical safety risks.<n>We propose dynamic safety shaping (DSS), a framework that uses fine-grained safety signals to reinforce learning from safe segments of a response while suppressing unsafe content.<n>We present STAR-DSS, guided by STAR scores, that robustly mitigates finetuning risks and delivers substantial safety improvements across diverse threats, datasets, and model families.
arXiv Detail & Related papers (2025-05-22T18:05:16Z)
SafeKey: Amplifying Aha-Moment Insights for Safety Reasoning [76.56522719330911]
Large Reasoning Models (LRMs) introduce a new generation paradigm of explicitly reasoning before answering.<n>LRMs pose great safety risks against harmful queries and adversarial attacks.<n>We propose SafeKey to better activate the safety aha moment in the key sentence.
arXiv Detail & Related papers (2025-05-22T03:46:03Z)
One Trigger Token Is Enough: A Defense Strategy for Balancing Safety and Usability in Large Language Models [20.42976162135529]
Large Language Models (LLMs) have been extensively used across diverse domains, including virtual assistants, automated code generation, and scientific research.<n>We propose textttD-STT, a simple yet effective defense algorithm that identifies and explicitly decodes safety trigger tokens of the given safety-aligned LLM.
arXiv Detail & Related papers (2025-05-12T01:26:50Z)
Maybe I Should Not Answer That, but... Do LLMs Understand The Safety of Their Inputs? [0.836362570897926]
We investigate existing methods for such generalization and find them insufficient.<n>To avoid performance degradation and preserve safe performance, we advocate for a two-step framework.<n>We find that the final hidden state for the last token is enough to provide robust performance.
arXiv Detail & Related papers (2025-02-22T10:31:50Z)
STAIR: Improving Safety Alignment with Introspective Reasoning [44.780098674618614]
We propose STAIR, a framework that integrates SafeTy Alignment with Itrospective Reasoning.<n>We show that STAIR effectively mitigates harmful outputs while better preserving helpfulness, compared to instinctive alignment strategies.<n>With test-time scaling, STAIR achieves a safety performance comparable to Claude-3.5 against popular jailbreak attacks.
arXiv Detail & Related papers (2025-02-04T15:02:55Z)
Separate the Wheat from the Chaff: A Post-Hoc Approach to Safety Re-Alignment for Fine-Tuned Language Models [30.93821289892195]
We propose IRR (Identify, Remove, and Recalibrate for Safety Realignment) that performs safety realignment for LLMs.<n>The core of IRR is to identify and remove unsafe delta parameters from the fine-tuned models, while recalibrating the retained ones.<n>Our results demonstrate that IRR significantly enhances the safety performance of fine-tuned models on safety benchmarks, such as harmful queries and jailbreak attacks.
arXiv Detail & Related papers (2024-12-15T03:58:38Z)
SCANS: Mitigating the Exaggerated Safety for LLMs via Safety-Conscious Activation Steering [56.92068213969036]
Safety alignment is indispensable for Large Language Models (LLMs) to defend threats from malicious instructions.<n>Recent researches reveal safety-aligned LLMs prone to reject benign queries due to the exaggerated safety issue.<n>We propose a Safety-Conscious Activation Steering (SCANS) method to mitigate the exaggerated safety concerns.
arXiv Detail & Related papers (2024-08-21T10:01:34Z)
ROSE Doesn't Do That: Boosting the Safety of Instruction-Tuned Large Language Models with Reverse Prompt Contrastive Decoding [89.0074567748505]
We present reverse prompt contrastive decoding (ROSE), a simple-yet-effective method to boost the safety of existing instruction-tuned LLMs without any additional training. Experiments on 6 safety and 2 general-purpose tasks show that, our ROSE not only brings consistent and significant safety improvements (up to +13.8% safety score) upon 5 types of instruction-tuned LLMs, but also benefits the general-purpose ability of LLMs.
arXiv Detail & Related papers (2024-02-19T06:58:42Z)
On Prompt-Driven Safeguarding for Large Language Models [172.13943777203377]
We find that in the representation space, the input queries are typically moved by safety prompts in a "higher-refusal" direction. Inspired by these findings, we propose a method for safety prompt optimization, namely DRO. Treating a safety prompt as continuous, trainable embeddings, DRO learns to move the queries' representations along or opposite the refusal direction, depending on their harmfulness.
arXiv Detail & Related papers (2024-01-31T17:28:24Z)

This list is automatically generated from the titles and abstracts of the papers in this site.