Universal Adversarial Suffixes Using Calibrated Gumbel-Softmax Relaxation
- URL: http://arxiv.org/abs/2512.08123v1
- Date: Tue, 09 Dec 2025 00:03:39 GMT
- Title: Universal Adversarial Suffixes Using Calibrated Gumbel-Softmax Relaxation
- Authors: Sampriti Soor, Suklav Ghosh, Arijit Sur,
- Abstract summary: We study universal adversarial suffixes that, when appended to any input, broadly reduce accuracy across tasks and models.<n>Our approach learns the suffix in a differentiable "soft" form using Gumbel-Softmax relaxation and then discretizes it for inference.<n>A single suffix trained on one model transfers effectively to others, consistently lowering both accuracy and calibrated confidence.
- Score: 9.099589602551573
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: Language models (LMs) are often used as zero-shot or few-shot classifiers by scoring label words, but they remain fragile to adversarial prompts. Prior work typically optimizes task- or model-specific triggers, making results difficult to compare and limiting transferability. We study universal adversarial suffixes: short token sequences (4-10 tokens) that, when appended to any input, broadly reduce accuracy across tasks and models. Our approach learns the suffix in a differentiable "soft" form using Gumbel-Softmax relaxation and then discretizes it for inference. Training maximizes calibrated cross-entropy on the label region while masking gold tokens to prevent trivial leakage, with entropy regularization to avoid collapse. A single suffix trained on one model transfers effectively to others, consistently lowering both accuracy and calibrated confidence. Experiments on sentiment analysis, natural language inference, paraphrase detection, commonsense QA, and physical reasoning with Qwen2-1.5B, Phi-1.5, and TinyLlama-1.1B demonstrate consistent attack effectiveness and transfer across tasks and model families.
Related papers
- CORE: Context-Robust Remasking for Diffusion Language Models [51.59514489363897]
We propose Context-Robust Remasking (CORE), a training-free framework for inference-time revision.<n>Rather than trusting static token probabilities, CORE identifies context-brittle tokens by probing their sensitivity to targeted masked-context perturbations.<n>On LLaDA-8B-Base, CORE delivers consistent improvements across reasoning and code benchmarks, outperforming compute-matched baselines and improving MBPP by up to 9.2 percentage points.
arXiv Detail & Related papers (2026-02-04T00:12:30Z) - Are you going to finish that? A Practical Study of the Partial Token Problem [85.49816027251013]
Language models (LMs) are trained over sequences of tokens, whereas users interact with LMs via text.<n>This mismatch gives rise to the partial token problem, which occurs when a user ends their prompt in the middle of the expected next-token.<n>In this work, we identify three domains where token and "word" boundaries often do not line up.
arXiv Detail & Related papers (2026-01-30T17:47:16Z) - Universal Adversarial Suffixes for Language Models Using Reinforcement Learning with Calibrated Reward [9.099589602551573]
Language models are vulnerable to short adversarial suffixes that can reliably alter predictions.<n>In this paper, a reinforcement learning framework is used where the suffix is treated as a policy and trained with Proximal Policy Optimization.<n>Results show that RL-trained suffixes consistently degrade accuracy and transfer more effectively across tasks and models than previous adversarial triggers of similar genres.
arXiv Detail & Related papers (2025-12-09T00:18:06Z) - From Flows to Words: Can Zero-/Few-Shot LLMs Detect Network Intrusions? A Grammar-Constrained, Calibrated Evaluation on UNSW-NB15 [0.41998444721319217]
Large Language Models (LLMs) can reason over natural-language inputs, but their role in intrusion detection without fine-tuning remains uncertain.<n>This study evaluates a promptonly approach by converting each network flow to a compact textual record and augmenting it with lightweight, domain-inspired flags.<n>We compare zero-shot, instruction-guided, and fewshot prompting to strong neural baselines under identical splits, reporting accuracy, precision, recall, F1, and macro scores.
arXiv Detail & Related papers (2025-10-18T02:11:50Z) - You only need 4 extra tokens: Synergistic Test-time Adaptation for LLMs [50.54173262572369]
Large language models (LLMs) are increasingly deployed in specialized domains such as finance, medicine, and agriculture.<n>We study label-free test-time adaptation for language models and present SyTTA, an inference-time framework that adapts models on-the-fly without additional supervision.
arXiv Detail & Related papers (2025-10-11T14:00:39Z) - Fast Controlled Generation from Language Models with Adaptive Weighted Rejection Sampling [90.86991492288487]
evaluating constraint on every token can be prohibitively expensive.<n> LCD can distort the global distribution over strings, sampling tokens based only on local information.<n>We show that our approach is superior to state-of-the-art baselines.
arXiv Detail & Related papers (2025-04-07T18:30:18Z) - Not all tokens are created equal: Perplexity Attention Weighted Networks for AI generated text detection [49.15148871877941]
Next-token distribution outputs offer a theoretically appealing approach for detection of large language models (LLMs)<n>We propose the Perplexity Attention Weighted Network (PAWN), which uses the last hidden states of the LLM and positions to weight the sum of a series of features based on metrics from the next-token distribution across the sequence length.<n>PAWN shows competitive and even better performance in-distribution than the strongest baselines with a fraction of their trainable parameters.
arXiv Detail & Related papers (2025-01-07T17:00:49Z) - Autoregressive Speech Synthesis without Vector Quantization [135.4776759536272]
We present MELLE, a novel continuous-valued token based language modeling approach for text-to-speech synthesis (TTS)<n>MELLE autoregressively generates continuous mel-spectrogram frames directly from text condition.<n>MELLE mitigates robustness issues by avoiding the inherent flaws of sampling vector-quantized codes.
arXiv Detail & Related papers (2024-07-11T14:36:53Z) - An Analysis and Mitigation of the Reversal Curse [70.13419502543915]
Recent research observed a noteworthy phenomenon in large language models (LLMs)
The reversal curse is that when dealing with two entities, $a$ and $b$, LLMs excel in handling sequences in the form of $aRb$,'' but encounter challenges when processing $bR-1a$''
arXiv Detail & Related papers (2023-11-13T17:01:12Z) - Nonparametric Masked Language Modeling [113.71921977520864]
Existing language models (LMs) predict tokens with a softmax over a finite vocabulary.
We introduce NPM, the first nonparametric masked language model that replaces this softmax with a nonparametric distribution over every phrase in a reference corpus.
NPM can be efficiently trained with a contrastive objective and an in-batch approximation to full corpus retrieval.
arXiv Detail & Related papers (2022-12-02T18:10:42Z) - MASKER: Masked Keyword Regularization for Reliable Text Classification [73.90326322794803]
We propose a fine-tuning method, coined masked keyword regularization (MASKER), that facilitates context-based prediction.
MASKER regularizes the model to reconstruct the keywords from the rest of the words and make low-confidence predictions without enough context.
We demonstrate that MASKER improves OOD detection and cross-domain generalization without degrading classification accuracy.
arXiv Detail & Related papers (2020-12-17T04:54:16Z) - Differentiable Language Model Adversarial Attacks on Categorical
Sequence Classifiers [0.0]
An adversarial attack paradigm explores various scenarios for the vulnerability of deep learning models.
We use a fine-tuning of a language model for adversarial attacks as a generator of adversarial examples.
Our model works for diverse datasets on bank transactions, electronic health records, and NLP datasets.
arXiv Detail & Related papers (2020-06-19T11:25:36Z) - Classifier-independent Lower-Bounds for Adversarial Robustness [13.247278149124757]
We theoretically analyse the limits of robustness to test-time adversarial and noisy examples in classification.
We use optimal transport theory to derive variational formulae for the Bayes-optimal error a classifier can make on a given classification problem.
We derive explicit lower-bounds on the Bayes-optimal error in the case of the popular distance-based attacks.
arXiv Detail & Related papers (2020-06-17T16:46:39Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.