Is LLM-as-a-Judge Robust? Investigating Universal Adversarial Attacks on Zero-shot LLM Assessment
- URL: http://arxiv.org/abs/2402.14016v2
- Date: Thu, 4 Jul 2024 12:34:44 GMT
- Title: Is LLM-as-a-Judge Robust? Investigating Universal Adversarial Attacks on Zero-shot LLM Assessment
- Authors: Vyas Raina, Adian Liusie, Mark Gales,
- Abstract summary: Large Language Models (LLMs) are powerful zero-shot assessors used in real-world situations such as assessing written exams and benchmarking systems.
We show that short universal adversarial phrases can be deceived to judge LLMs to predict inflated scores.
It is found that judge-LLMs are significantly more susceptible to these adversarial attacks when used for absolute scoring.
- Score: 8.948475969696075
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Large Language Models (LLMs) are powerful zero-shot assessors used in real-world situations such as assessing written exams and benchmarking systems. Despite these critical applications, no existing work has analyzed the vulnerability of judge-LLMs to adversarial manipulation. This work presents the first study on the adversarial robustness of assessment LLMs, where we demonstrate that short universal adversarial phrases can be concatenated to deceive judge LLMs to predict inflated scores. Since adversaries may not know or have access to the judge-LLMs, we propose a simple surrogate attack where a surrogate model is first attacked, and the learned attack phrase then transferred to unknown judge-LLMs. We propose a practical algorithm to determine the short universal attack phrases and demonstrate that when transferred to unseen models, scores can be drastically inflated such that irrespective of the assessed text, maximum scores are predicted. It is found that judge-LLMs are significantly more susceptible to these adversarial attacks when used for absolute scoring, as opposed to comparative assessment. Our findings raise concerns on the reliability of LLM-as-a-judge methods, and emphasize the importance of addressing vulnerabilities in LLM assessment methods before deployment in high-stakes real-world scenarios.
Related papers
- FLEX: A Benchmark for Evaluating Robustness of Fairness in Large Language Models [7.221774553388335]
We introduce a new benchmark to test whether Large Language Models can sustain fairness even when exposed to prompts constructed to induce bias.
We integrate prompts that amplify potential biases into the fairness assessment.
This highlights the need for more stringent evaluation benchmarks to guarantee safety and fairness.
arXiv Detail & Related papers (2025-03-25T10:48:33Z) - Know Thy Judge: On the Robustness Meta-Evaluation of LLM Safety Judges [3.168632659778101]
We highlight two critical challenges that are typically overlooked: (i) evaluations in the wild where factors like prompt sensitivity and distribution shifts can affect performance and (ii) adversarial attacks that target the judge.
We show that small changes such as the style of the model output can lead to jumps of up to 0.24 in the false negative rate on the same dataset, whereas adversarial attacks on the model generation can fool some judges into misclassifying 100% of harmful generations as safe ones.
arXiv Detail & Related papers (2025-03-06T14:24:12Z) - LLM-Safety Evaluations Lack Robustness [58.334290876531036]
We argue that current safety alignment research efforts for large language models are hindered by many intertwined sources of noise.
We propose a set of guidelines for reducing noise and bias in evaluations of future attack and defense papers.
arXiv Detail & Related papers (2025-03-04T12:55:07Z) - PredictaBoard: Benchmarking LLM Score Predictability [50.47497036981544]
Large Language Models (LLMs) often fail unpredictably.
This poses a significant challenge to ensuring their safe deployment.
We present PredictaBoard, a novel collaborative benchmarking framework.
arXiv Detail & Related papers (2025-02-20T10:52:38Z) - Towards Understanding the Robustness of LLM-based Evaluations under Perturbations [9.944512689015998]
Large Language Models (LLMs) can serve as automatic evaluators for non-standardized metrics in summarization and dialog-based tasks.
We conduct experiments across multiple prompting strategies to examine how LLMs fare as quality evaluators when compared with human judgments.
arXiv Detail & Related papers (2024-12-12T13:31:58Z) - LLM Detectors Still Fall Short of Real World: Case of LLM-Generated Short News-Like Posts [7.680851067579922]
This paper focuses on an important setting in information operations -- short news-like posts generated by moderately sophisticated attackers.
We demonstrate that existing LLM detectors, whether zero-shot or purpose-trained, are not ready for real-world use in that setting.
A purpose-trained detector generalizing across LLMs and unseen attacks can be developed, but it fails to generalize to new human-written texts.
arXiv Detail & Related papers (2024-09-05T06:55:13Z) - Decoding Biases: Automated Methods and LLM Judges for Gender Bias Detection in Language Models [47.545382591646565]
Large Language Models (LLMs) have excelled at language understanding and generating human-level text.
LLMs are susceptible to adversarial attacks where malicious users prompt the model to generate undesirable text.
In this work, we train models to automatically create adversarial prompts to elicit biased responses from target LLMs.
arXiv Detail & Related papers (2024-08-07T17:11:34Z) - Uncertainty is Fragile: Manipulating Uncertainty in Large Language Models [79.76293901420146]
Large Language Models (LLMs) are employed across various high-stakes domains, where the reliability of their outputs is crucial.
Our research investigates the fragility of uncertainty estimation and explores potential attacks.
We demonstrate that an attacker can embed a backdoor in LLMs, which, when activated by a specific trigger in the input, manipulates the model's uncertainty without affecting the final output.
arXiv Detail & Related papers (2024-07-15T23:41:11Z) - SORRY-Bench: Systematically Evaluating Large Language Model Safety Refusal Behaviors [64.9938658716425]
Existing evaluations of large language models' (LLMs) ability to recognize and reject unsafe user requests face three limitations.
First, existing methods often use coarse-grained of unsafe topics, and are over-representing some fine-grained topics.
Second, linguistic characteristics and formatting of prompts are often overlooked, like different languages, dialects, and more -- which are only implicitly considered in many evaluations.
Third, existing evaluations rely on large LLMs for evaluation, which can be expensive.
arXiv Detail & Related papers (2024-06-20T17:56:07Z) - Evaluating Implicit Bias in Large Language Models by Attacking From a Psychometric Perspective [66.34066553400108]
We conduct a rigorous evaluation of Large Language Models' implicit bias towards certain groups by attacking them with carefully crafted instructions to elicit biased responses.
We propose three attack approaches, i.e., Disguise, Deception, and Teaching, based on which we built evaluation datasets for four common bias types.
arXiv Detail & Related papers (2024-06-20T06:42:08Z) - Can Many-Shot In-Context Learning Help LLMs as Evaluators? A Preliminary Empirical Study [14.906150451947443]
We study two many-shot ICL prompts for helping evaluators to mitigate the potential biases in Large Language Models (LLMs)
Based on the designed prompts, we investigate the impact of scaling the number of in-context examples on the consistency and quality of the evaluation results.
Experimental results show that advanced LLMs, such as GPT-4o, perform better in the many-shot regime than in the zero-shot regime.
arXiv Detail & Related papers (2024-06-17T15:11:58Z) - Don't Say No: Jailbreaking LLM by Suppressing Refusal [13.666830169722576]
In this study, we first uncover the reason why vanilla target loss is not optimal, then we explore and enhance the loss objective and introduce the DSN (Don't Say No) attack.
The existing evaluation such as refusal keyword matching reveals numerous false positive and false negative instances.
To overcome this challenge, we propose an Ensemble Evaluation pipeline that novelly incorporates Natural Language Inference (NLI) contradiction assessment and two external LLM evaluators.
arXiv Detail & Related papers (2024-04-25T07:15:23Z) - Unveiling the Misuse Potential of Base Large Language Models via In-Context Learning [61.2224355547598]
Open-sourcing of large language models (LLMs) accelerates application development, innovation, and scientific progress.
Our investigation exposes a critical oversight in this belief.
By deploying carefully designed demonstrations, our research demonstrates that base LLMs could effectively interpret and execute malicious instructions.
arXiv Detail & Related papers (2024-04-16T13:22:54Z) - AttackEval: How to Evaluate the Effectiveness of Jailbreak Attacking on Large Language Models [29.92550386563915]
We introduce an innovative framework that can help evaluate the effectiveness of jailbreak attacks on large language models.
We present two distinct evaluation frameworks: a coarse-grained evaluation and a fine-grained evaluation.
We develop a comprehensive ground truth dataset specifically tailored for jailbreak prompts.
arXiv Detail & Related papers (2024-01-17T06:42:44Z) - Benchmarking and Defending Against Indirect Prompt Injection Attacks on
Large Language Models [82.98081731588717]
Integration of large language models with external content exposes applications to indirect prompt injection attacks.
We introduce the first benchmark for indirect prompt injection attacks, named BIPIA, to evaluate the risk of such attacks.
We develop two black-box methods based on prompt learning and a white-box defense method based on fine-tuning with adversarial training.
arXiv Detail & Related papers (2023-12-21T01:08:39Z) - SmoothLLM: Defending Large Language Models Against Jailbreaking Attacks [99.23352758320945]
We propose SmoothLLM, the first algorithm designed to mitigate jailbreaking attacks on large language models (LLMs)
Based on our finding that adversarially-generated prompts are brittle to character-level changes, our defense first randomly perturbs multiple copies of a given input prompt, and then aggregates the corresponding predictions to detect adversarial inputs.
arXiv Detail & Related papers (2023-10-05T17:01:53Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.