Related papers: A Granular Study of Safety Pretraining under Model Abliteration

A Granular Study of Safety Pretraining under Model Abliteration

URL: http://arxiv.org/abs/2510.02768v1
Date: Fri, 03 Oct 2025 07:01:45 GMT
Title: A Granular Study of Safety Pretraining under Model Abliteration
Authors: Shashank Agnihotri, Jonas Jakubassa, Priyam Dey, Sachin Goyal, Bernt Schiele, Venkatesh Babu Radhakrishnan, Margret Keuper,
Abstract summary: We study model abliteration, a lightweight projection technique designed to remove refusal-sensitive directions.<n>We issue 100 prompts with balanced harmful and harmless cases, classify responses as **Refusal** or **Non-Refusal** using multiple judges, and validate judge fidelity.<n>Our study produces a checkpoint-level characterization of which data-centric safety components remain robust under abliteration.
Score: 64.24346997570275
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Open-weight LLMs can be modified at inference time with simple activation edits, which raises a practical question for safety: do common safety interventions like refusal training or metatag training survive such edits? We study model abliteration, a lightweight projection technique designed to remove refusal-sensitive directions, and conduct a controlled evaluation across a granular sequence of Safety Pretraining checkpoints for SmolLM2-1.7B, alongside widely used open baselines. For each of 20 systems, original and abliterated, we issue 100 prompts with balanced harmful and harmless cases, classify responses as **Refusal** or **Non-Refusal** using multiple judges, and validate judge fidelity on a small human-labeled subset. We also probe whether models can identify refusal in their own outputs. Our study produces a checkpoint-level characterization of which data-centric safety components remain robust under abliteration, quantifies how judge selection influences evaluation outcomes, and outlines a practical protocol for integrating inference-time edits into safety assessments. Code: https://github.com/shashankskagnihotri/safety_pretraining.

Related papers

Reasoned Safety Alignment: Ensuring Jailbreak Defense via Answer-Then-Check [32.82170313959032]
We introduce a novel safety alignment approach called Answer-Then-Check.<n>Our method enables models to directly answer the question in their thought and then critically evaluate its safety.<n>We find that training on a small subset of just 500 examples can achieve comparable performance to using the full dataset.
arXiv Detail & Related papers (2025-09-15T06:47:35Z)
LLM Hypnosis: Exploiting User Feedback for Unauthorized Knowledge Injection to All Users [50.18141341939909]
We describe a vulnerability in language models trained with user feedback.<n>A single user can persistently alter LM knowledge and behavior.<n>We show that this attack can be used to insert factual knowledge the model did not previously possess.
arXiv Detail & Related papers (2025-07-03T17:55:40Z)
Probing the Robustness of Large Language Models Safety to Latent Perturbations [30.16804362984161]
Safety alignment is a key requirement for building reliable Artificial General Intelligence.<n>We observe that minor latent shifts can still trigger unsafe responses in aligned models.<n>We introduce Layer-wise Adversarial Patch Training(LAPT), a fine-tuning strategy that injects controlled perturbations into hidden representations during training.
arXiv Detail & Related papers (2025-06-19T07:03:05Z)
Wolf Hidden in Sheep's Conversations: Toward Harmless Data-Based Backdoor Attacks for Jailbreaking Large Language Models [81.44934796068495]
Supervised fine-tuning (SFT) aligns large language models with human intent by training them on labeled task-specific data.<n>Malicious attackers can inject backdoors into these models by embedding triggers into the harmful question-answer (QA) pairs.<n>We propose a novel textitclean-data backdoor attack for jailbreaking LLMs.
arXiv Detail & Related papers (2025-05-23T08:13:59Z)
Shape it Up! Restoring LLM Safety during Finetuning [66.46166656543761]
Finetuning large language models (LLMs) enables user-specific customization but introduces critical safety risks.<n>We propose dynamic safety shaping (DSS), a framework that uses fine-grained safety signals to reinforce learning from safe segments of a response while suppressing unsafe content.<n>We present STAR-DSS, guided by STAR scores, that robustly mitigates finetuning risks and delivers substantial safety improvements across diverse threats, datasets, and model families.
arXiv Detail & Related papers (2025-05-22T18:05:16Z)
CARES: Comprehensive Evaluation of Safety and Adversarial Robustness in Medical LLMs [7.597770587484936]
We introduce CARES (Clinical Adversarial Robustness and Evaluation of Safety), a benchmark for evaluating medical large language models (LLMs) safety in healthcare.<n> CARES includes over 18,000 prompts spanning eight medical safety principles, four harm levels, and four prompting styles to simulate both malicious and benign use cases.<n>Our analysis reveals that many state-of-the-art LLMs remain vulnerable to jailbreaks that subtly rephrase harmful prompts, while also over-refusing safe but atypically phrased queries.
arXiv Detail & Related papers (2025-05-16T16:25:51Z)
Superficial Safety Alignment Hypothesis [15.215130286922564]
We propose the Superficial Safety Alignment Hypothesis (SSAH), which posits that safety alignment teaches an otherwise unsafe model to choose the correct reasoning direction.<n>We identify four types of attribute-critical components: Safety Critical Unit (SCU), Utility Critical Unit (UCU), Complex Unit (CU) and Redundant Unit (RU)<n>Our findings show that freezing certain safety-critical components during fine-tuning allows the model to retain its safety attributes while adapting to new tasks.
arXiv Detail & Related papers (2024-10-07T19:53:35Z)
Refuse Whenever You Feel Unsafe: Improving Safety in LLMs via Decoupled Refusal Training [67.30423823744506]
We introduce a novel approach, Decoupled Refusal Training (DeRTa), designed to empower LLMs to refuse compliance to harmful prompts at any response position.<n>DeRTa incorporates two novel components: (1) Maximum Likelihood Estimation with Harmful Response Prefix, which trains models to recognize and avoid unsafe content by appending a segment of harmful response to the beginning of a safe response, and (2) Reinforced Transition Optimization (RTO), which equips models with the ability to transition from potential harm to safety refusal consistently throughout the harmful response sequence.
arXiv Detail & Related papers (2024-07-12T09:36:33Z)
SORRY-Bench: Systematically Evaluating Large Language Model Safety Refusal [64.9938658716425]
SORRY-Bench is a proposed benchmark for evaluating large language models' (LLMs) ability to recognize and reject unsafe user requests.<n>First, existing methods often use coarse-grained taxonomy of unsafe topics, and are over-representing some fine-grained topics.<n>Second, linguistic characteristics and formatting of prompts are often overlooked, like different languages, dialects, and more -- which are only implicitly considered in many evaluations.
arXiv Detail & Related papers (2024-06-20T17:56:07Z)
BFClass: A Backdoor-free Text Classification Framework [21.762274809679692]
We propose BFClass, a novel efficient backdoor-free training framework for text classification. The backbone of BFClass is a pre-trained discriminator that predicts whether each token in the corrupted input was replaced by a masked language model. Extensive experiments demonstrate that BFClass can identify all the triggers, remove 95% poisoned training samples with very limited false alarms, and achieve almost the same performance as the models trained on the benign training data.
arXiv Detail & Related papers (2021-09-22T17:28:21Z)

This list is automatically generated from the titles and abstracts of the papers in this site.