Know Thy Judge: On the Robustness Meta-Evaluation of LLM Safety Judges
- URL: http://arxiv.org/abs/2503.04474v1
- Date: Thu, 06 Mar 2025 14:24:12 GMT
- Title: Know Thy Judge: On the Robustness Meta-Evaluation of LLM Safety Judges
- Authors: Francisco Eiras, Eliott Zemour, Eric Lin, Vaikkunth Mugunthan,
- Abstract summary: We highlight two critical challenges that are typically overlooked: (i) evaluations in the wild where factors like prompt sensitivity and distribution shifts can affect performance and (ii) adversarial attacks that target the judge.<n>We show that small changes such as the style of the model output can lead to jumps of up to 0.24 in the false negative rate on the same dataset, whereas adversarial attacks on the model generation can fool some judges into misclassifying 100% of harmful generations as safe ones.
- Score: 3.168632659778101
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Large Language Model (LLM) based judges form the underpinnings of key safety evaluation processes such as offline benchmarking, automated red-teaming, and online guardrailing. This widespread requirement raises the crucial question: can we trust the evaluations of these evaluators? In this paper, we highlight two critical challenges that are typically overlooked: (i) evaluations in the wild where factors like prompt sensitivity and distribution shifts can affect performance and (ii) adversarial attacks that target the judge. We highlight the importance of these through a study of commonly used safety judges, showing that small changes such as the style of the model output can lead to jumps of up to 0.24 in the false negative rate on the same dataset, whereas adversarial attacks on the model generation can fool some judges into misclassifying 100% of harmful generations as safe ones. These findings reveal gaps in commonly used meta-evaluation benchmarks and weaknesses in the robustness of current LLM judges, indicating that low attack success under certain judges could create a false sense of security.
Related papers
- FLEX: A Benchmark for Evaluating Robustness of Fairness in Large Language Models [7.221774553388335]
We introduce a new benchmark to test whether Large Language Models can sustain fairness even when exposed to prompts constructed to induce bias.
We integrate prompts that amplify potential biases into the fairness assessment.
This highlights the need for more stringent evaluation benchmarks to guarantee safety and fairness.
arXiv Detail & Related papers (2025-03-25T10:48:33Z) - LLM-Safety Evaluations Lack Robustness [58.334290876531036]
We argue that current safety alignment research efforts for large language models are hindered by many intertwined sources of noise.<n>We propose a set of guidelines for reducing noise and bias in evaluations of future attack and defense papers.
arXiv Detail & Related papers (2025-03-04T12:55:07Z) - A Framework for Evaluating Vision-Language Model Safety: Building Trust in AI for Public Sector Applications [0.0]
This paper introduces a novel framework to quantify adversarial risks in Vision-Language Models (VLMs)
We analyze model performance under Gaussian, salt-and-pepper, and uniform noise, identifying misclassification thresholds and deriving composite noise patches and saliency patterns that highlight vulnerable regions.
We propose a new Vulnerability Score that combines the impact of random noise and adversarial attacks, providing a comprehensive metric for evaluating model robustness.
arXiv Detail & Related papers (2025-02-22T21:33:26Z) - PredictaBoard: Benchmarking LLM Score Predictability [50.47497036981544]
Large Language Models (LLMs) often fail unpredictably.<n>This poses a significant challenge to ensuring their safe deployment.<n>We present PredictaBoard, a novel collaborative benchmarking framework.
arXiv Detail & Related papers (2025-02-20T10:52:38Z) - JudgeBench: A Benchmark for Evaluating LLM-based Judges [61.048125269475854]
JudgeBench is a benchmark for evaluating LLM-based judges on challenging response pairs spanning knowledge, reasoning, math, and coding.
Our comprehensive evaluation on a collection of prompted judges, fine-tuned judges, multi-agent judges, and reward models shows that JudgeBench poses a significantly greater challenge than previous benchmarks.
arXiv Detail & Related papers (2024-10-16T17:58:19Z) - Criticality and Safety Margins for Reinforcement Learning [53.10194953873209]
We seek to define a criticality framework with both a quantifiable ground truth and a clear significance to users.
We introduce true criticality as the expected drop in reward when an agent deviates from its policy for n consecutive random actions.
We also introduce the concept of proxy criticality, a low-overhead metric that has a statistically monotonic relationship to true criticality.
arXiv Detail & Related papers (2024-09-26T21:00:45Z) - Jailbreaking as a Reward Misspecification Problem [80.52431374743998]
We propose a novel perspective that attributes this vulnerability to reward misspecification during the alignment process.<n>We introduce a metric ReGap to quantify the extent of reward misspecification and demonstrate its effectiveness.<n>We present ReMiss, a system for automated red teaming that generates adversarial prompts in a reward-misspecified space.
arXiv Detail & Related papers (2024-06-20T15:12:27Z) - Judging the Judges: Evaluating Alignment and Vulnerabilities in LLMs-as-Judges [6.609843448260634]
The LLM-as-a-judge paradigm is rapidly gaining traction as an approach to evaluating large language models.<n>This paper focuses on a clean scenario in which inter-human agreement is high.<n>We identify vulnerabilities in judge models, such as their sensitivity to prompt complexity and length, and a tendency toward leniency.
arXiv Detail & Related papers (2024-06-18T13:49:54Z) - Towards Effective Evaluations and Comparisons for LLM Unlearning Methods [97.2995389188179]
This paper seeks to refine the evaluation of machine unlearning for large language models.<n>It addresses two key challenges -- the robustness of evaluation metrics and the trade-offs between competing goals.
arXiv Detail & Related papers (2024-06-13T14:41:00Z) - Is LLM-as-a-Judge Robust? Investigating Universal Adversarial Attacks on Zero-shot LLM Assessment [8.948475969696075]
Large Language Models (LLMs) are powerful zero-shot assessors used in real-world situations such as assessing written exams and benchmarking systems.
We show that short universal adversarial phrases can be deceived to judge LLMs to predict inflated scores.
It is found that judge-LLMs are significantly more susceptible to these adversarial attacks when used for absolute scoring.
arXiv Detail & Related papers (2024-02-21T18:55:20Z) - Familiarity-Based Open-Set Recognition Under Adversarial Attacks [9.934489379453812]
We study gradient-based adversarial attacks on familiarity scores for both types of attacks, False Familiarity and False Novelty attacks.<n>We formulate the adversarial reaction score as an alternative OSR scoring rule, which shows a high correlation with the MLS familiarity score.
arXiv Detail & Related papers (2023-11-08T20:17:35Z) - Fairness Evaluation in Presence of Biased Noisy Labels [84.12514975093826]
We propose a sensitivity analysis framework for assessing how assumptions on the noise across groups affect the predictive bias properties of the risk assessment model.
Our experimental results on two real world criminal justice data sets demonstrate how even small biases in the observed labels may call into question the conclusions of an analysis based on the noisy outcome.
arXiv Detail & Related papers (2020-03-30T20:47:00Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.