Related papers: Adversarial Preference Learning for Robust LLM Alignment

Adversarial Preference Learning for Robust LLM Alignment

URL: http://arxiv.org/abs/2505.24369v1
Date: Fri, 30 May 2025 09:02:07 GMT
Title: Adversarial Preference Learning for Robust LLM Alignment
Authors: Yuanfu Wang, Pengyu Wang, Chenyang Xi, Bo Tang, Junyi Zhu, Wenqiang Wei, Chen Chen, Chao Yang, Jingfeng Zhang, Chaochao Lu, Yijun Niu, Keming Mao, Zhiyu Li, Feiyu Xiong, Jie Hu, Mingchuan Yang,
Abstract summary: Adversarial Preference Learning (APL) is an iterative adversarial training method incorporating three key innovations.<n>First, a direct harmfulness metric based on the model's intrinsic preference probabilities.<n>Second, a conditional generative attacker that synthesizes input-specific adversarial variations.
Score: 24.217309343426297
License: http://creativecommons.org/licenses/by-sa/4.0/
Abstract: Modern language models often rely on Reinforcement Learning from Human Feedback (RLHF) to encourage safe behaviors. However, they remain vulnerable to adversarial attacks due to three key limitations: (1) the inefficiency and high cost of human annotation, (2) the vast diversity of potential adversarial attacks, and (3) the risk of feedback bias and reward hacking. To address these challenges, we introduce Adversarial Preference Learning (APL), an iterative adversarial training method incorporating three key innovations. First, a direct harmfulness metric based on the model's intrinsic preference probabilities, eliminating reliance on external assessment. Second, a conditional generative attacker that synthesizes input-specific adversarial variations. Third, an iterative framework with automated closed-loop feedback, enabling continuous adaptation through vulnerability discovery and mitigation. Experiments on Mistral-7B-Instruct-v0.3 demonstrate that APL significantly enhances robustness, achieving 83.33% harmlessness win rate over the base model (evaluated by GPT-4o), reducing harmful outputs from 5.88% to 0.43% (measured by LLaMA-Guard), and lowering attack success rate by up to 65% according to HarmBench. Notably, APL maintains competitive utility, with an MT-Bench score of 6.59 (comparable to the baseline 6.78) and an LC-WinRate of 46.52% against the base model.

Related papers

In-Context Environments Induce Evaluation-Awareness in Language Models [0.12691047660244334]
Humans often become more self-aware under threat, yet can lose self-awareness when absorbed in a task.<n>We introduce a black-box adversarial optimization framework treating the in-context prompt as an optimizable environment.<n>We show that adversarially optimized prompts pose a substantially greater threat to evaluation reliability than previously understood.
arXiv Detail & Related papers (2026-03-04T08:22:02Z)
What Matters For Safety Alignment? [38.86339753409445]
This paper presents a comprehensive empirical study on the safety alignment capabilities of AI systems.<n>We systematically investigate and compare the influence of six critical intrinsic model characteristics and three external attack techniques.<n>We identify the LRMs GPT-OSS-20B, Qwen3-Next-80B-A3B-Thinking, and GPT-OSS-120B as the top-three safest models.
arXiv Detail & Related papers (2026-01-07T12:31:52Z)
Think Twice, Generate Once: Safeguarding by Progressive Self-Reflection [18.467741067831877]
We introduce Progressive Self-Reflection, a novel inference-time technique that empowers large language models to self-monitor and correct their outputs dynamically.<n> Experimental results demonstrate that applying our proposed method to Llama-3.1-8B-Instruct reduces the attack success rate from 77.5% to 5.9%.<n>Our approach acts as a test-time scaling method, where additional self-reflection rounds enhance safety at the cost of inference overhead.
arXiv Detail & Related papers (2025-09-29T12:54:28Z)
Confidence-Aware Routing for Large Language Model Reliability Enhancement: A Multi-Signal Approach to Pre-Generation Hallucination Mitigation [0.0]
Large Language Models suffer from hallucination, generating plausible yet factually incorrect content.<n>Current mitigation strategies focus on post-generation correction, which is computationally expensive and fails to prevent unreliable content generation.<n>We propose a confidence-aware routing system that proactively assesses model uncertainty before generation and redirects queries based on estimated reliability.
arXiv Detail & Related papers (2025-09-23T18:34:20Z)
LLaVA-Critic-R1: Your Critic Model is Secretly a Strong Policy Model [99.71684530652942]
We show that LLaVA-Critic-R1 emerges as a top-performing critic but also as a competitive policy model.<n>Applying self-critique at test time yields an average +13.8% improvement on five representative reasoning tasks.<n>Our results reveal that RL training on critic data can produce a unified model excelling at both evaluation and generation.
arXiv Detail & Related papers (2025-08-31T03:08:02Z)
ORFuzz: Fuzzing the "Other Side" of LLM Safety -- Testing Over-Refusal [27.26251627767238]
Large Language Models (LLMs) increasingly exhibit over-refusal - erroneously rejecting benign queries due to overly conservative safety measures.<n>This paper introduces the first evolutionary testing framework, ORFuzz, for the systematic detection and analysis of LLM over-refusals.
arXiv Detail & Related papers (2025-08-15T05:03:26Z)
Outlier-Safe Pre-Training for Robust 4-Bit Quantization of Large Language Models [15.218318229687242]
Extreme activation outliers in Large Language Models critically degrade quantization performance.<n>We introduce Outlier-Safe Pre-Training (OSP), a practical guideline that proactively prevents formation.<n>Our work demonstrates that outliers are not inherent to LLMs but are consequences of training strategies.
arXiv Detail & Related papers (2025-06-24T15:03:57Z)
Reasoning Models Are More Easily Gaslighted Than You Think [85.84943447589511]
We evaluate three state-of-the-art reasoning models, including OpenAI's o4-mini, Claude-3.7-Sonnet and Gemini-2.5-Flash.<n>Our evaluation reveals significant accuracy drops following gaslighting negation prompts.<n>We introduce GaslightingBench-R, a new diagnostic benchmark designed to evaluate reasoning models' susceptibility to defend their belief.
arXiv Detail & Related papers (2025-06-11T12:52:25Z)
Critique-GRPO: Advancing LLM Reasoning with Natural Language and Numerical Feedback [59.078756231841574]
Critique-GRPO is an online RL framework that integrates both natural language and numerical feedback for effective policy optimization.<n>We show Critique-GRPO consistently outperforms supervised learning and RL-based fine-tuning methods across eight challenging mathematical, STEM, and general reasoning tasks.
arXiv Detail & Related papers (2025-06-03T17:39:02Z)
Neural Antidote: Class-Wise Prompt Tuning for Purifying Backdoors in CLIP [51.04452017089568]
Class-wise Backdoor Prompt Tuning (CBPT) is an efficient and effective defense mechanism that operates on text prompts to indirectly purify CLIP.<n>CBPT significantly mitigates backdoor threats while preserving model utility.
arXiv Detail & Related papers (2025-02-26T16:25:15Z)
Model Tampering Attacks Enable More Rigorous Evaluations of LLM Capabilities [49.09703018511403]
Evaluations of large language model (LLM) risks and capabilities are increasingly being incorporated into AI risk management and governance frameworks.<n>Currently, most risk evaluations are conducted by designing inputs that elicit harmful behaviors from the system.<n>We propose evaluating LLMs with model tampering attacks which allow for modifications to latent activations or weights.
arXiv Detail & Related papers (2025-02-03T18:59:16Z)
Improved Generation of Adversarial Examples Against Safety-aligned LLMs [72.38072942860309]
Adversarial prompts generated using gradient-based methods exhibit outstanding performance in performing automatic jailbreak attacks against safety-aligned LLMs. In this paper, we explore a new perspective on this problem, suggesting that it can be alleviated by leveraging innovations inspired in transfer-based attacks. We show that 87% of the query-specific adversarial suffixes generated by the developed combination can induce Llama-2-7B-Chat to produce the output that exactly matches the target string on AdvBench.
arXiv Detail & Related papers (2024-05-28T06:10:12Z)
Efficient Adversarial Training in LLMs with Continuous Attacks [99.5882845458567]
Large language models (LLMs) are vulnerable to adversarial attacks that can bypass their safety guardrails. We propose a fast adversarial training algorithm (C-AdvUL) composed of two losses. C-AdvIPO is an adversarial variant of IPO that does not require utility data for adversarially robust alignment.
arXiv Detail & Related papers (2024-05-24T14:20:09Z)
RAIN: Your Language Models Can Align Themselves without Finetuning [25.703729145091483]
Large language models (LLMs) often demonstrate inconsistencies with human preferences. We show that unaligned LLMs can directly produce responses consistent with human preferences via self-boosting. We introduce a novel inference method, Rewindable Auto-regressive INference (RAIN), that allows pre-trained LLMs to evaluate their own generation.
arXiv Detail & Related papers (2023-09-13T17:59:09Z)
Model-tuning Via Prompts Makes NLP Models Adversarially Robust [97.02353907677703]
We show surprising gains in adversarial robustness enjoyed by Model-tuning Via Prompts (MVP) MVP improves performance against adversarial substitutions by an average of 8% over standard methods. We also conduct ablations to investigate the mechanism underlying these gains.
arXiv Detail & Related papers (2023-03-13T17:41:57Z)
Stable and Efficient Adversarial Training through Local Linearization [0.5076419064097734]
A phenomenon referred to as catastrophic overfitting" has been observed, which is prevalent in single-step defenses. We propose a novel method, Stable and Efficient Adversarial Training (SEAT), which mitigates catastrophic overfitting. Our single-step method can reach 51% robust accuracy for CIFAR-10 with $l_infty$ perturbations of radius $8/255$ under a strong PGD-50 attack.
arXiv Detail & Related papers (2022-10-11T11:57:37Z)

This list is automatically generated from the titles and abstracts of the papers in this site.