Related papers: Toward Understanding the Transferability of Adversarial Suffixes in Large Language Models

Toward Understanding the Transferability of Adversarial Suffixes in Large Language Models

URL: http://arxiv.org/abs/2510.22014v1
Date: Fri, 24 Oct 2025 20:28:49 GMT
Title: Toward Understanding the Transferability of Adversarial Suffixes in Large Language Models
Authors: Sarah Ball, Niki Hasrati, Alexander Robey, Avi Schwarzschild, Frauke Kreuter, Zico Kolter, Andrej Risteski,
Abstract summary: Discrete optimization-based jailbreaking attacks aim to generate nonsensical suffixes that, when appended onto input prompts, elicit disallowed content.<n>We find that prompt semantic similarity only weakly correlates with transfer success.<n>These findings lead to a more fine-grained understanding of transferability, which we use in interventional experiments to showcase how our statistical analysis can translate into practical improvements in attack success.
Score: 70.11800794130394
License: http://creativecommons.org/licenses/by-sa/4.0/
Abstract: Discrete optimization-based jailbreaking attacks on large language models aim to generate short, nonsensical suffixes that, when appended onto input prompts, elicit disallowed content. Notably, these suffixes are often transferable -- succeeding on prompts and models for which they were never optimized. And yet, despite the fact that transferability is surprising and empirically well-established, the field lacks a rigorous analysis of when and why transfer occurs. To fill this gap, we identify three statistical properties that strongly correlate with transfer success across numerous experimental settings: (1) how much a prompt without a suffix activates a model's internal refusal direction, (2) how strongly a suffix induces a push away from this direction, and (3) how large these shifts are in directions orthogonal to refusal. On the other hand, we find that prompt semantic similarity only weakly correlates with transfer success. These findings lead to a more fine-grained understanding of transferability, which we use in interventional experiments to showcase how our statistical analysis can translate into practical improvements in attack success.

Related papers

Evolving Prompts for Toxicity Search in Large Language Models [3.2729350470429783]
ToxSearch is an evolutionary framework that tests model safety by evolving prompts in a steady-state loop.<n>We observe practically meaningful but attenuated cross-model transfer, with roughly halving toxicity on most targets.<n>These results suggest that small, controllable perturbations are effective vehicles for systematic red-teaming.
arXiv Detail & Related papers (2025-11-16T07:47:31Z)
Crossing the Threshold: Idiomatic Machine Translation through Retrieval Augmentation and Loss Weighting [66.02718577386426]
We provide a simple characterization of idiomatic translation and related issues. We conduct a synthetic experiment revealing a tipping point at which transformer-based machine translation models correctly default to idiomatic translations. To improve translation of natural idioms, we introduce two straightforward yet effective techniques.
arXiv Detail & Related papers (2023-10-10T23:47:25Z)
Lost In Translation: Generating Adversarial Examples Robust to Round-Trip Translation [66.33340583035374]
We present a comprehensive study on the robustness of current text adversarial attacks to round-trip translation. We demonstrate that 6 state-of-the-art text-based adversarial attacks do not maintain their efficacy after round-trip translation. We introduce an intervention-based solution to this problem, by integrating Machine Translation into the process of adversarial example generation.
arXiv Detail & Related papers (2023-07-24T04:29:43Z)
Why Does Little Robustness Help? A Further Step Towards Understanding Adversarial Transferability [23.369773251447636]
Adversarial examples (AEs) for DNNs have been shown to be transferable.<n>In this paper, we take a further step towards understanding adversarial transferability.
arXiv Detail & Related papers (2023-07-15T19:20:49Z)
Adversarial Attacks are a Surprisingly Strong Baseline for Poisoning Few-Shot Meta-Learners [28.468089304148453]
We attack amortized meta-learners, which allows us to craft colluding sets of inputs that fool the system's learning algorithm. We show that in a white box setting, these attacks are very successful and can cause the target model's predictions to become worse than chance. We explore two hypotheses to explain this: 'overfitting' by the attack, and mismatch between the model on which the attack is generated and that to which the attack is transferred.
arXiv Detail & Related papers (2022-11-23T14:55:44Z)
GAPX: Generalized Autoregressive Paraphrase-Identification X [24.331570697458954]
A major source of this performance drop comes from biases introduced by negative examples. We introduce a perplexity based out-of-distribution metric that we show can effectively and automatically determine how much weight it should be given during inference.
arXiv Detail & Related papers (2022-10-05T01:23:52Z)
Transfer Attacks Revisited: A Large-Scale Empirical Study in Real Computer Vision Settings [64.37621685052571]
We conduct the first systematic empirical study of transfer attacks against major cloud-based ML platforms. The study leads to a number of interesting findings which are inconsistent to the existing ones. We believe this work sheds light on the vulnerabilities of popular ML platforms and points to a few promising research directions.
arXiv Detail & Related papers (2022-04-07T12:16:24Z)
Avoiding Inference Heuristics in Few-shot Prompt-based Finetuning [57.4036085386653]
We show that prompt-based models for sentence pair classification tasks still suffer from a common pitfall of adopting inferences based on lexical overlap. We then show that adding a regularization that preserves pretraining weights is effective in mitigating this destructive tendency of few-shot finetuning.
arXiv Detail & Related papers (2021-09-09T10:10:29Z)
Weakly-Supervised Cross-Domain Adaptation for Endoscopic Lesions Segmentation [79.58311369297635]
We propose a new weakly-supervised lesions transfer framework, which can explore transferable domain-invariant knowledge across different datasets. A Wasserstein quantified transferability framework is developed to highlight widerange transferable contextual dependencies. A novel self-supervised pseudo label generator is designed to equally provide confident pseudo pixel labels for both hard-to-transfer and easy-to-transfer target samples.
arXiv Detail & Related papers (2020-12-08T02:26:03Z)
CSCL: Critical Semantic-Consistent Learning for Unsupervised Domain Adaptation [42.226842513334184]
We develop a new Critical Semantic-Consistent Learning model, which mitigates the discrepancy of both domain-wise and category-wise distributions. Specifically, a critical transfer based adversarial framework is designed to highlight transferable domain-wise knowledge while neglecting untransferable knowledge.
arXiv Detail & Related papers (2020-08-24T14:12:04Z)

This list is automatically generated from the titles and abstracts of the papers in this site.