Related papers: Alignment-Weighted DPO: A principled reasoning approach to improve safety alignment

Alignment-Weighted DPO: A principled reasoning approach to improve safety alignment

URL: http://arxiv.org/abs/2602.21346v1
Date: Tue, 24 Feb 2026 20:30:51 GMT
Title: Alignment-Weighted DPO: A principled reasoning approach to improve safety alignment
Authors: Mengxuan Hu, Vivek V. Datla, Anoop Kumar, Zihan Guan, Sheng Li, Alfy Samuel, Daben Liu,
Abstract summary: Large language models are vulnerable to attacks that disguise harmful intent.<n>This vulnerability stems from shallow alignment mechanisms that lack deep reasoning.<n>We propose enhancing alignment through reasoning-aware post-training.
Score: 13.463606100715504
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Recent advances in alignment techniques such as Supervised Fine-Tuning (SFT), Reinforcement Learning from Human Feedback (RLHF), and Direct Preference Optimization (DPO) have improved the safety of large language models (LLMs). However, these LLMs remain vulnerable to jailbreak attacks that disguise harmful intent through indirect or deceptive phrasing. Using causal intervention, we empirically demonstrate that this vulnerability stems from shallow alignment mechanisms that lack deep reasoning, often rejecting harmful prompts without truly understanding why they are harmful. To mitigate this vulnerability, we propose enhancing alignment through reasoning-aware post-training. We construct and release a novel Chain-of-Thought (CoT) fine-tuning dataset that includes both utility-oriented and safety-critical prompts with step-by-step rationales. Fine-tuning on this dataset encourages models to produce principled refusals grounded in reasoning, outperforming standard SFT baselines. Furthermore, inspired by failure patterns in CoT fine-tuning, we introduce Alignment-Weighted DPO, which targets the most problematic parts of an output by assigning different preference weights to the reasoning and final-answer segments. This produces finer-grained, targeted updates than vanilla DPO and improves robustness to diverse jailbreak strategies. Extensive experiments across multiple safety and utility benchmarks show that our method consistently improves alignment robustness while maintaining overall model utility.

Related papers

Detoxifying LLMs via Representation Erasure-Based Preference Optimization [44.29978832356216]
Large language models (LLMs) trained on webscale data can produce toxic outputs.<n>Prior defenses, based on applications of DPO, NPO, and similar algorithms, reduce the likelihood of harmful continuations.<n>We propose Representation Erasure-based Preference Optimization (REPO), reformulating detoxification as a token-level preference problem.
arXiv Detail & Related papers (2026-02-24T22:51:06Z)
THINKSAFE: Self-Generated Safety Alignment for Reasoning Models [60.10077024249373]
We propose ThinkSafe, a framework that restores safety alignment without external teachers.<n>Our key insight is that while compliance suppresses safety mechanisms, models often retain latent knowledge to identify harm.<n> Experiments on DeepSeek-R1-Distill and Qwen3 show ThinkSafe significantly improves safety while preserving reasoning proficiency.
arXiv Detail & Related papers (2026-01-30T16:31:02Z)
Self-Guided Defense: Adaptive Safety Alignment for Reasoning Models via Synthesized Guidelines [31.031589383127677]
This paper introduces the Synthesized Guideline-based Adaptive Safety Alignment (SGASA) framework.<n>It internalizes model-generated safety guidelines to strengthen models' ability to enhance robustness against adversarial prompts.<n> experiments across multiple datasets demonstrate that SGASA significantly improves model safety, validating its adaptive and scalable effectiveness.
arXiv Detail & Related papers (2025-11-26T09:44:32Z)
Large Reasoning Models Learn Better Alignment from Flawed Thinking [56.08883934423522]
Large reasoning models (LRMs) "think" by generating structured chain-of-thought (CoT) before producing a final answer.<n>We propose RECAP, a principled reinforcement learning (RL) method for post-training that explicitly teaches models to override flawed reasoning trajectories.
arXiv Detail & Related papers (2025-10-01T14:15:43Z)
AdvChain: Adversarial Chain-of-Thought Tuning for Robust Safety Alignment of Large Reasoning Models [62.70575022567081]
We propose AdvChain, an alignment paradigm that teaches models dynamic self-correction through adversarial CoT tuning.<n>Our work establishes a new direction for building more robust and reliable reasoning models.
arXiv Detail & Related papers (2025-09-29T04:27:23Z)
bi-GRPO: Bidirectional Optimization for Jailbreak Backdoor Injection on LLMs [33.470999703070866]
Existing approaches to embedding jailbreak triggers suffer from limitations including poor generalization, compromised stealthiness, or reduced contextual usability.<n>We propose bi-GRPO, a novel RL-based framework tailored explicitly for jailbreak backdoor injection.
arXiv Detail & Related papers (2025-09-24T05:56:41Z)
SAFER: Advancing Safety Alignment via Efficient Ex-Ante Reasoning [51.78514648677898]
We propose SAFER, a framework for Safety Alignment via eFficient Ex-Ante Reasoning.<n>Our approach instantiates structured Ex-Ante reasoning through initial assessment, rule verification, and path calibration.<n> Experiments on multiple open-source LLMs demonstrate that SAFER significantly enhances safety performance while maintaining helpfulness and response efficiency.
arXiv Detail & Related papers (2025-04-03T16:07:38Z)
Improving LLM Safety Alignment with Dual-Objective Optimization [81.98466438000086]
Existing training-time safety alignment techniques for large language models (LLMs) remain vulnerable to jailbreak attacks.<n>We propose an improved safety alignment that disentangles DPO objectives into two components: (1) robust refusal training, which encourages refusal even when partial unsafe generations are produced, and (2) targeted unlearning of harmful knowledge.
arXiv Detail & Related papers (2025-03-05T18:01:05Z)
Reasoning-to-Defend: Safety-Aware Reasoning Can Defend Large Language Models from Jailbreaking [54.10710423370126]
We propose Reasoning-to-Defend (R2D), a training paradigm that integrates a safety-aware reasoning mechanism into Large Language Models' generation process.<n>CPO enhances the model's perception of the safety status of given dialogues.<n>Experiments demonstrate that R2D effectively mitigates various attacks and improves overall safety, while maintaining the original performances.
arXiv Detail & Related papers (2025-02-18T15:48:46Z)
Deliberative Alignment: Reasoning Enables Safer Language Models [64.60765108418062]
We introduce Deliberative Alignment, a new paradigm that teaches the model safety specifications and trains it to explicitly recall and accurately reason over the specifications before answering.<n>We used this approach to align OpenAI's o-series models, and achieved highly precise adherence to OpenAI's safety policies, without requiring human-written chain-of-thoughts or answers.
arXiv Detail & Related papers (2024-12-20T21:00:11Z)

This list is automatically generated from the titles and abstracts of the papers in this site.