Related papers: Matching Ranks Over Probability Yields Truly Deep Safety Alignment

Matching Ranks Over Probability Yields Truly Deep Safety Alignment

URL: http://arxiv.org/abs/2512.05518v1
Date: Fri, 05 Dec 2025 08:22:27 GMT
Title: Matching Ranks Over Probability Yields Truly Deep Safety Alignment
Authors: Jason Vega, Gagandeep Singh,
Abstract summary: Recent work proposed a supervised fine-tuning (SFT) defense using data augmentation to achieve a enquotedeep safety alignment.<n>We show in this work that the "deep" safety alignment produced by such an approach is in fact not very deep.<n>We propose a new perspective on achieving deep safety alignment by matching the token ranks of the target distribution, rather than their probabilities.
Score: 8.692532301315135
License: http://creativecommons.org/licenses/by/4.0/
Abstract: A frustratingly easy technique known as the prefilling attack has been shown to effectively circumvent the safety alignment of frontier LLMs by simply prefilling the assistant response with an affirmative prefix before decoding. In response, recent work proposed a supervised fine-tuning (SFT) defense using data augmentation to achieve a \enquote{deep} safety alignment, allowing the model to generate natural language refusals immediately following harmful prefills. Unfortunately, we show in this work that the "deep" safety alignment produced by such an approach is in fact not very deep. A generalization of the prefilling attack, which we refer to as the Rank-Assisted Prefilling (RAP) attack, can effectively extract harmful content from models fine-tuned with the data augmentation defense by selecting low-probability "harmful" tokens from the top 20 predicted next tokens at each step (thus ignoring high-probability "refusal" tokens). We argue that this vulnerability is enabled due to the "gaming" of the SFT objective when the target distribution entropies are low, where low fine-tuning loss is achieved by shifting large probability mass to a small number of refusal tokens while neglecting the high ranks of harmful tokens. We then propose a new perspective on achieving deep safety alignment by matching the token ranks of the target distribution, rather than their probabilities. This perspective yields a surprisingly simple fix to the data augmentation defense based on regularizing the attention placed on harmful prefill tokens, an approach we call PRefill attEntion STOpping (PRESTO). Adding PRESTO yields up to a 4.7x improvement in the mean StrongREJECT score under RAP attacks across three popular open-source LLMs, with low impact to model utility.

Related papers

Alignment-Weighted DPO: A principled reasoning approach to improve safety alignment [13.463606100715504]
Large language models are vulnerable to attacks that disguise harmful intent.<n>This vulnerability stems from shallow alignment mechanisms that lack deep reasoning.<n>We propose enhancing alignment through reasoning-aware post-training.
arXiv Detail & Related papers (2026-02-24T20:30:51Z)
Defenses Against Prompt Attacks Learn Surface Heuristics [40.392588465939106]
Large language models (LLMs) are increasingly deployed in security-sensitive applications.<n>LLMs may override intended logic when adversarial instructions appear in user queries or retrieved content.<n>Recent defenses rely on supervised fine-tuning with benign and malicious labels.
arXiv Detail & Related papers (2026-01-12T04:12:48Z)
Any-Depth Alignment: Unlocking Innate Safety Alignment of LLMs to Any-Depth [19.670368480802725]
We propose Any-Depth Alignment (ADA), an effective inference-time defense with negligible overhead.<n>ADA induces the model to reassess harmfulness and recover refusals at any point in generation.<n>It secures a near-100% refusal rate against challenging adversarial prefill attacks ranging from dozens to thousands of tokens.
arXiv Detail & Related papers (2025-10-20T20:18:59Z)
One Token Embedding Is Enough to Deadlock Your Large Reasoning Model [91.48868589442837]
We present the Deadlock Attack, a resource exhaustion method that hijacks an LRM's generative control flow.<n>Our method achieves a 100% attack success rate across four advanced LRMs.
arXiv Detail & Related papers (2025-10-12T07:42:57Z)
Revisiting Backdoor Attacks on LLMs: A Stealthy and Practical Poisoning Framework via Harmless Inputs [54.90315421117162]
We propose a novel poisoning method via completely harmless data.<n>Inspired by the causal reasoning in auto-regressive LLMs, we aim to establish robust associations between triggers and an affirmative response prefix.<n>We observe an interesting resistance phenomenon where the LLM initially appears to agree but subsequently refuses to answer.
arXiv Detail & Related papers (2025-05-23T08:13:59Z)
One Trigger Token Is Enough: A Defense Strategy for Balancing Safety and Usability in Large Language Models [20.42976162135529]
Large Language Models (LLMs) have been extensively used across diverse domains, including virtual assistants, automated code generation, and scientific research.<n>We propose textttD-STT, a simple yet effective defense algorithm that identifies and explicitly decodes safety trigger tokens of the given safety-aligned LLM.
arXiv Detail & Related papers (2025-05-12T01:26:50Z)
REINFORCE Adversarial Attacks on Large Language Models: An Adaptive, Distributional, and Semantic Objective [57.57786477441956]
We propose an adaptive and semantic optimization problem over the population of responses.<n>Our objective doubles the attack success rate (ASR) on Llama3 and increases the ASR from 2% to 50% with circuit breaker defense.
arXiv Detail & Related papers (2025-02-24T15:34:48Z)
Safety Alignment Should Be Made More Than Just a Few Tokens Deep [48.823599143711235]
The safety alignment of current Large Language Models (LLMs) is vulnerable. Relatively simple attacks, or even benign fine-tuning, can jailbreak aligned models. We show that deepening the safety alignment beyond just the first few tokens can often meaningfully improve robustness against some common exploits.
arXiv Detail & Related papers (2024-06-10T00:35:23Z)
Mind the Inconspicuous: Revealing the Hidden Weakness in Aligned LLMs' Refusal Boundaries [22.24239212756129]
We find that simply appending multiple end of sequence (eos) tokens can cause a phenomenon we call context segmentation.<n>We propose a straightforward method to BOOST jailbreak attacks by appending eos tokens.
arXiv Detail & Related papers (2024-05-31T07:41:03Z)
Token-Level Adversarial Prompt Detection Based on Perplexity Measures and Contextual Information [67.78183175605761]
Large Language Models are susceptible to adversarial prompt attacks. This vulnerability underscores a significant concern regarding the robustness and reliability of LLMs. We introduce a novel approach to detecting adversarial prompts at a token level.
arXiv Detail & Related papers (2023-11-20T03:17:21Z)
A Self-supervised Approach for Adversarial Robustness [105.88250594033053]
Adversarial examples can cause catastrophic mistakes in Deep Neural Network (DNNs) based vision systems. This paper proposes a self-supervised adversarial training mechanism in the input space. It provides significant robustness against the textbfunseen adversarial attacks.
arXiv Detail & Related papers (2020-06-08T20:42:39Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.