Related papers: Universal Adversarial Suffixes for Language Models Using Reinforcement Learning with Calibrated Reward

Universal Adversarial Suffixes for Language Models Using Reinforcement Learning with Calibrated Reward

URL: http://arxiv.org/abs/2512.08131v1
Date: Tue, 09 Dec 2025 00:18:06 GMT
Title: Universal Adversarial Suffixes for Language Models Using Reinforcement Learning with Calibrated Reward
Authors: Sampriti Soor, Suklav Ghosh, Arijit Sur,
Abstract summary: Language models are vulnerable to short adversarial suffixes that can reliably alter predictions.<n>In this paper, a reinforcement learning framework is used where the suffix is treated as a policy and trained with Proximal Policy Optimization.<n>Results show that RL-trained suffixes consistently degrade accuracy and transfer more effectively across tasks and models than previous adversarial triggers of similar genres.
Score: 9.099589602551573
License: http://creativecommons.org/licenses/by-nc-sa/4.0/
Abstract: Language models are vulnerable to short adversarial suffixes that can reliably alter predictions. Previous works usually find such suffixes with gradient search or rule-based methods, but these are brittle and often tied to a single task or model. In this paper, a reinforcement learning framework is used where the suffix is treated as a policy and trained with Proximal Policy Optimization against a frozen model as a reward oracle. Rewards are shaped using calibrated cross-entropy, removing label bias and aggregating across surface forms to improve transferability. The proposed method is evaluated on five diverse NLP benchmark datasets, covering sentiment, natural language inference, paraphrase, and commonsense reasoning, using three distinct language models: Qwen2-1.5B Instruct, TinyLlama-1.1B Chat, and Phi-1.5. Results show that RL-trained suffixes consistently degrade accuracy and transfer more effectively across tasks and models than previous adversarial triggers of similar genres.

Related papers

Universal Adversarial Suffixes Using Calibrated Gumbel-Softmax Relaxation [9.099589602551573]
We study universal adversarial suffixes that, when appended to any input, broadly reduce accuracy across tasks and models.<n>Our approach learns the suffix in a differentiable "soft" form using Gumbel-Softmax relaxation and then discretizes it for inference.<n>A single suffix trained on one model transfers effectively to others, consistently lowering both accuracy and calibrated confidence.
arXiv Detail & Related papers (2025-12-09T00:03:39Z)
DBR: Divergence-Based Regularization for Debiasing Natural Language Understanding Models [50.54264918467997]
Pre-trained language models (PLMs) have achieved impressive results on various natural language processing tasks.<n>Recent research has revealed that these models often rely on superficial features and shortcuts instead of developing a genuine understanding of language.<n>We propose Divergence Based Regularization (DBR) to mitigate this shortcut learning behavior.
arXiv Detail & Related papers (2025-02-25T16:44:10Z)
Mitigating Reversal Curse in Large Language Models via Semantic-aware Permutation Training [57.771940716189114]
We show that large language models (LLMs) suffer from the "reversal curse" The root cause of the reversal curse lies in the different word order between the training and inference stage. We propose Semantic-aware Permutation Training (SPT) to address this issue.
arXiv Detail & Related papers (2024-03-01T18:55:20Z)
Fine-tuning Language Models for Factuality [96.5203774943198]
Large pre-trained language models (LLMs) have led to their widespread use, sometimes even as a replacement for traditional search engines. Yet language models are prone to making convincing but factually inaccurate claims, often referred to as 'hallucinations' In this work, we fine-tune language models to be more factual, without human labeling.
arXiv Detail & Related papers (2023-11-14T18:59:15Z)
Improving Temporal Generalization of Pre-trained Language Models with Lexical Semantic Change [28.106524698188675]
Recent research has revealed that neural language models at scale suffer from poor temporal generalization capability. We propose a simple yet effective lexical-level masking strategy to post-train a converged language model.
arXiv Detail & Related papers (2022-10-31T08:12:41Z)
Quark: Controllable Text Generation with Reinforced Unlearning [68.07749519374089]
Large-scale language models often learn behaviors that are misaligned with user expectations. We introduce Quantized Reward Konditioning (Quark), an algorithm for optimizing a reward function that quantifies an (un)wanted property. For unlearning toxicity, negative sentiment, and repetition, our experiments show that Quark outperforms both strong baselines and state-of-the-art reinforcement learning methods.
arXiv Detail & Related papers (2022-05-26T21:11:51Z)
Adversarial Training for Improving Model Robustness? Look at Both Prediction and Interpretation [21.594361495948316]
We propose a novel feature-level adversarial training method named FLAT. FLAT incorporates variational word masks in neural networks to learn global word importance. Experiments show the effectiveness of FLAT in improving the robustness with respect to both predictions and interpretations.
arXiv Detail & Related papers (2022-03-23T20:04:14Z)
How Should Pre-Trained Language Models Be Fine-Tuned Towards Adversarial Robustness? [121.57551065856164]
We propose Robust Informative Fine-Tuning (RIFT) as a novel adversarial fine-tuning method from an information-theoretical perspective. RIFT encourages an objective model to retain the features learned from the pre-trained model throughout the entire fine-tuning process. Experimental results show that RIFT consistently outperforms the state-of-the-arts on two popular NLP tasks.
arXiv Detail & Related papers (2021-12-22T05:04:41Z)
$k$-Neighbor Based Curriculum Sampling for Sequence Prediction [22.631763991832862]
Multi-step ahead prediction in language models is challenging due to discrepancy between training and test time processes. We propose textitNearest-Neighbor Replacement Sampling -- a curriculum learning-based method that gradually changes an initially deterministic teacher policy. We report our findings on two language modelling benchmarks and find that the proposed method further improves performance when used in conjunction with scheduled sampling.
arXiv Detail & Related papers (2021-01-22T20:07:29Z)
Grounded Compositional Outputs for Adaptive Language Modeling [59.02706635250856]
A language model's vocabulary$-$typically selected before training and permanently fixed later$-$affects its size. We propose a fully compositional output embedding layer for language models. To our knowledge, the result is the first word-level language model with a size that does not depend on the training vocabulary.
arXiv Detail & Related papers (2020-09-24T07:21:14Z)

This list is automatically generated from the titles and abstracts of the papers in this site.