Related papers: Fast Proxies for LLM Robustness Evaluation

Fast Proxies for LLM Robustness Evaluation

URL: http://arxiv.org/abs/2502.10487v1
Date: Fri, 14 Feb 2025 11:15:27 GMT
Title: Fast Proxies for LLM Robustness Evaluation
Authors: Tim Beyer, Jan Schuchardt, Leo Schwinn, Stephan Günnemann,
Abstract summary: We compare the ability of fast proxy metrics to predict the real-world robustness of an LLM against a simulated attacker ensemble.<n>This allows us to estimate a model's robustness to computationally expensive attacks without requiring runs of the attacks themselves.
Score: 48.53873823665833
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Evaluating the robustness of LLMs to adversarial attacks is crucial for safe deployment, yet current red-teaming methods are often prohibitively expensive. We compare the ability of fast proxy metrics to predict the real-world robustness of an LLM against a simulated attacker ensemble. This allows us to estimate a model's robustness to computationally expensive attacks without requiring runs of the attacks themselves. Specifically, we consider gradient-descent-based embedding-space attacks, prefilling attacks, and direct prompting. Even though direct prompting in particular does not achieve high ASR, we find that it and embedding-space attacks can predict attack success rates well, achieving $r_p=0.87$ (linear) and $r_s=0.94$ (Spearman rank) correlations with the full attack ensemble while reducing computational cost by three orders of magnitude.

Related papers

Poisoning Attacks to Local Differential Privacy for Ranking Estimation [8.14832255549522]
Local differential privacy (LDP) involves users perturbing their inputs to provide plausible deniability of their data.<n>In this paper, we first introduce novel poisoning attacks for ranking estimation.<n>We propose corresponding strategies for kRR, OUE, and OLH protocols.
arXiv Detail & Related papers (2025-06-30T16:39:02Z)
Goal-guided Generative Prompt Injection Attack on Large Language Models [6.175969971471705]
Large language models (LLMs) provide a strong foundation for large-scale user-oriented natural language tasks. A large number of users can easily inject adversarial text or instructions through the user interface. It is unclear how these strategies relate to the success rate of attacks and thus effectively improve model security.
arXiv Detail & Related papers (2024-04-06T06:17:10Z)
Attacking Large Language Models with Projected Gradient Descent [49.19426387912186]
Projected Gradient Descent (PGD) for adversarial prompts is up to one order of magnitude faster than state-of-the-art discrete optimization. Our PGD for LLMs is up to one order of magnitude faster than state-of-the-art discrete optimization to achieve the same devastating attack results.
arXiv Detail & Related papers (2024-02-14T13:13:26Z)
Attack Prompt Generation for Red Teaming and Defending Large Language Models [70.157691818224]
Large language models (LLMs) are susceptible to red teaming attacks, which can induce LLMs to generate harmful content. We propose an integrated approach that combines manual and automatic methods to economically generate high-quality attack prompts.
arXiv Detail & Related papers (2023-10-19T06:15:05Z)
SmoothLLM: Defending Large Language Models Against Jailbreaking Attacks [99.23352758320945]
We propose SmoothLLM, the first algorithm designed to mitigate jailbreaking attacks on large language models (LLMs) Based on our finding that adversarially-generated prompts are brittle to character-level changes, our defense first randomly perturbs multiple copies of a given input prompt, and then aggregates the corresponding predictions to detect adversarial inputs.
arXiv Detail & Related papers (2023-10-05T17:01:53Z)
Practical Evaluation of Adversarial Robustness via Adaptive Auto Attack [96.50202709922698]
A practical evaluation method should be convenient (i.e., parameter-free), efficient (i.e., fewer iterations) and reliable. We propose a parameter-free Adaptive Auto Attack (A$3$) evaluation method which addresses the efficiency and reliability in a test-time-training fashion.
arXiv Detail & Related papers (2022-03-10T04:53:54Z)
Membership Inference Attacks From First Principles [24.10746844866869]
A membership inference attack allows an adversary to query a trained machine learning model to predict whether or not a particular example was contained in the model's training dataset. These attacks are currently evaluated using average-case "accuracy" metrics that fail to characterize whether the attack can confidently identify any members of the training set. We argue that attacks should instead be evaluated by computing their true-positive rate at low false-positive rates, and find most prior attacks perform poorly when evaluated in this way. Our attack is 10x more powerful at low false-positive rates, and also strictly dominates prior attacks on existing metrics.
arXiv Detail & Related papers (2021-12-07T08:47:00Z)
Composite Adversarial Attacks [57.293211764569996]
Adversarial attack is a technique for deceiving Machine Learning (ML) models. In this paper, a new procedure called Composite Adrial Attack (CAA) is proposed for automatically searching the best combination of attack algorithms. CAA beats 10 top attackers on 11 diverse defenses with less elapsed time.
arXiv Detail & Related papers (2020-12-10T03:21:16Z)
RayS: A Ray Searching Method for Hard-label Adversarial Attack [99.72117609513589]
We present the Ray Searching attack (RayS), which greatly improves the hard-label attack effectiveness as well as efficiency. RayS attack can also be used as a sanity check for possible "falsely robust" models.
arXiv Detail & Related papers (2020-06-23T07:01:50Z)

This list is automatically generated from the titles and abstracts of the papers in this site.