Safety Pretraining: Toward the Next Generation of Safe AI
- URL: http://arxiv.org/abs/2504.16980v2
- Date: Mon, 15 Sep 2025 17:51:55 GMT
- Title: Safety Pretraining: Toward the Next Generation of Safe AI
- Authors: Pratyush Maini, Sachin Goyal, Dylan Sam, Alex Robey, Yash Savani, Yiding Jiang, Andy Zou, Matt Fredrikson, Zacharcy C. Lipton, J. Zico Kolter,
- Abstract summary: We present a data-centric pretraining framework that builds safety into the model from the start.<n>Our framework consists of four key steps: Safety Filtering, Safety Rephrasing, Native Refusal and Harmfulness-Tag annotated pretraining.<n>Our safety-pretrained models reduce attack success rates from 38.8% to 8.4% on standard LLM safety benchmarks with no performance on general degradation tasks.
- Score: 68.99129474671282
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: As large language models (LLMs) are increasingly deployed in high-stakes settings, the risk of generating harmful or toxic content remains a central challenge. Post-hoc alignment methods are brittle: once unsafe patterns are learned during pretraining, they are hard to remove. In this work, we present a data-centric pretraining framework that builds safety into the model from the start. Our framework consists of four key steps: (i) Safety Filtering: building a safety classifier to classify webdata into safe and unsafe categories; (ii) Safety Rephrasing: we recontextualize unsafe webdata into safer narratives; (iii) Native Refusal: we develop RefuseWeb and Moral Education pretraining datasets that actively teach model to refuse on unsafe content and the moral reasoning behind it, and (iv) Harmfulness-Tag annotated pretraining: we flag unsafe content during pretraining using a special token, and use it to steer model away from unsafe generations at inference. Our safety-pretrained models reduce attack success rates from 38.8\% to 8.4\% on standard LLM safety benchmarks with no performance degradation on general tasks.
Related papers
- When Should We Introduce Safety Interventions During Pretraining? [100.3502954292386]
Prior work has shown that interventions applied during pretraining, such as rephrasing harmful content, can substantially improve the safety of the resulting models.<n>We find that introducing interventions earlier generally yields more robust models with no increase in overrefusal rates.<n>We also see clear benefits in the steerability of models towards safer generations.
arXiv Detail & Related papers (2026-01-11T22:38:17Z) - UnsafeChain: Enhancing Reasoning Model Safety via Hard Cases [57.69882799751655]
We release UnsafeChain, a safety alignment dataset constructed from hard prompts with diverse sources.<n>We fine-tune three large reasoning models (LRMs) and compare them against recent SafeChain and STAR-1.<n>UnsafeChain consistently outperforms prior datasets, with even a 1K subset matching or surpassing baseline performance.
arXiv Detail & Related papers (2025-07-29T10:08:52Z) - Circumventing Safety Alignment in Large Language Models Through Embedding Space Toxicity Attenuation [13.971909819796762]
Large Language Models (LLMs) have achieved remarkable success across domains such as healthcare, education, and cybersecurity.<n>Embedding space poisoning is a subtle attack vector where adversaries manipulate the internal semantic representations of input data to bypass safety alignment mechanisms.<n>We propose ETTA, a novel framework that identifies and attenuates toxicity-sensitive dimensions in embedding space via linear transformations.
arXiv Detail & Related papers (2025-07-08T03:01:00Z) - Shape it Up! Restoring LLM Safety during Finetuning [66.46166656543761]
Finetuning large language models (LLMs) enables user-specific customization but introduces critical safety risks.<n>We propose dynamic safety shaping (DSS), a framework that uses fine-grained safety signals to reinforce learning from safe segments of a response while suppressing unsafe content.<n>We present STAR-DSS, guided by STAR scores, that robustly mitigates finetuning risks and delivers substantial safety improvements across diverse threats, datasets, and model families.
arXiv Detail & Related papers (2025-05-22T18:05:16Z) - Safe Vision-Language Models via Unsafe Weights Manipulation [75.04426753720551]
We revise safety evaluation by introducing Safe-Ground, a new set of metrics that evaluate safety at different levels of granularity.<n>We take a different direction and explore whether it is possible to make a model safer without training, introducing Unsafe Weights Manipulation (UWM)<n>UWM uses a calibration set of safe and unsafe instances to compare activations between safe and unsafe content, identifying the most important parameters for processing the latter.
arXiv Detail & Related papers (2025-03-14T17:00:22Z) - STAIR: Improving Safety Alignment with Introspective Reasoning [44.780098674618614]
We propose STAIR, a framework that integrates SafeTy Alignment with Itrospective Reasoning.<n>We show that STAIR effectively mitigates harmful outputs while better preserving helpfulness, compared to instinctive alignment strategies.<n>With test-time scaling, STAIR achieves a safety performance comparable to Claude-3.5 against popular jailbreak attacks.
arXiv Detail & Related papers (2025-02-04T15:02:55Z) - Vulnerability Mitigation for Safety-Aligned Language Models via Debiasing [12.986006070964772]
Safety alignment is an essential research topic for real-world AI applications.<n>Our study first identified the difficulty of eliminating such vulnerabilities without sacrificing the model's helpfulness.<n>Our method could enhance the model's helpfulness while maintaining safety, thus improving the trade-off-front.
arXiv Detail & Related papers (2025-02-04T09:31:54Z) - Backtracking Improves Generation Safety [27.214507875399086]
We propose backtracking, a technique that allows language models to "undo" and recover from their own unsafe generation.
We show that models trained to backtrack are consistently safer than baseline models.
arXiv Detail & Related papers (2024-09-22T20:28:40Z) - What Makes and Breaks Safety Fine-tuning? A Mechanistic Study [64.9691741899956]
Safety fine-tuning helps align Large Language Models (LLMs) with human preferences for their safe deployment.
We design a synthetic data generation framework that captures salient aspects of an unsafe input.
Using this, we investigate three well-known safety fine-tuning methods.
arXiv Detail & Related papers (2024-07-14T16:12:57Z) - Refuse Whenever You Feel Unsafe: Improving Safety in LLMs via Decoupled Refusal Training [67.30423823744506]
This study addresses a critical gap in safety tuning practices for Large Language Models (LLMs)
We introduce a novel approach, Decoupled Refusal Training (DeRTa), designed to empower LLMs to refuse compliance to harmful prompts at any response position.
DeRTa incorporates two novel components: (1) Maximum Likelihood Estimation with Harmful Response Prefix, which trains models to recognize and avoid unsafe content by appending a segment of harmful response to the beginning of a safe response, and (2) Reinforced Transition Optimization (RTO), which equips models with the ability to transition from potential harm to safety refusal consistently throughout the harmful
arXiv Detail & Related papers (2024-07-12T09:36:33Z) - BEEAR: Embedding-based Adversarial Removal of Safety Backdoors in Instruction-tuned Language Models [57.5404308854535]
Safety backdoor attacks in large language models (LLMs) enable the stealthy triggering of unsafe behaviors while evading detection during normal interactions.
We present BEEAR, a mitigation approach leveraging the insight that backdoor triggers induce relatively uniform drifts in the model's embedding space.
Our bi-level optimization method identifies universal embedding perturbations that elicit unwanted behaviors and adjusts the model parameters to reinforce safe behaviors against these perturbations.
arXiv Detail & Related papers (2024-06-24T19:29:47Z) - Safety Arithmetic: A Framework for Test-time Safety Alignment of Language Models by Steering Parameters and Activations [19.132597762214722]
Current alignment methods struggle with dynamic user intentions and complex objectives.
We propose Safety Arithmetic, a training-free framework enhancing safety across different scenarios.
Our experiments show that Safety Arithmetic significantly improves safety measures, reduces over-safety, and maintains model utility.
arXiv Detail & Related papers (2024-06-17T17:48:13Z) - Emulated Disalignment: Safety Alignment for Large Language Models May Backfire! [65.06450319194454]
Large language models (LLMs) undergo safety alignment to ensure safe conversations with humans.
This paper introduces a training-free attack method capable of reversing safety alignment.
We name this method emulated disalignment (ED) because sampling from this contrastive distribution provably emulates the result of fine-tuning to minimize a safety reward.
arXiv Detail & Related papers (2024-02-19T18:16:51Z) - Learning Barrier Certificates: Towards Safe Reinforcement Learning with
Zero Training-time Violations [64.39401322671803]
This paper explores the possibility of safe RL algorithms with zero training-time safety violations.
We propose an algorithm, Co-trained Barrier Certificate for Safe RL (CRABS), which iteratively learns barrier certificates, dynamics models, and policies.
arXiv Detail & Related papers (2021-08-04T04:59:05Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.