Related papers: Know Thy Judge: On the Robustness Meta-Evaluation of LLM Safety Judges

Know Thy Judge: On the Robustness Meta-Evaluation of LLM Safety Judges

URL: http://arxiv.org/abs/2503.04474v1
Date: Thu, 06 Mar 2025 14:24:12 GMT
Title: Know Thy Judge: On the Robustness Meta-Evaluation of LLM Safety Judges
Authors: Francisco Eiras, Eliott Zemour, Eric Lin, Vaikkunth Mugunthan,
Abstract summary: We highlight two critical challenges that are typically overlooked: (i) evaluations in the wild where factors like prompt sensitivity and distribution shifts can affect performance and (ii) adversarial attacks that target the judge.<n>We show that small changes such as the style of the model output can lead to jumps of up to 0.24 in the false negative rate on the same dataset, whereas adversarial attacks on the model generation can fool some judges into misclassifying 100% of harmful generations as safe ones.
Score: 3.168632659778101
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Large Language Model (LLM) based judges form the underpinnings of key safety evaluation processes such as offline benchmarking, automated red-teaming, and online guardrailing. This widespread requirement raises the crucial question: can we trust the evaluations of these evaluators? In this paper, we highlight two critical challenges that are typically overlooked: (i) evaluations in the wild where factors like prompt sensitivity and distribution shifts can affect performance and (ii) adversarial attacks that target the judge. We highlight the importance of these through a study of commonly used safety judges, showing that small changes such as the style of the model output can lead to jumps of up to 0.24 in the false negative rate on the same dataset, whereas adversarial attacks on the model generation can fool some judges into misclassifying 100% of harmful generations as safe ones. These findings reveal gaps in commonly used meta-evaluation benchmarks and weaknesses in the robustness of current LLM judges, indicating that low attack success under certain judges could create a false sense of security.

Related papers

Counterfactual Evaluation for Blind Attack Detection in LLM-based Evaluation Systems [24.312079827029326]
We formalize a class of threats called blind attacks, where a candidate answer is crafted independently of the true answer to deceive the evaluator.<n>To counter such attacks, we propose a framework that augments Standard Evaluation (SE) with Counterfactual Evaluation (CFE)<n>An attack is detected if the system validates an answer under both standard and counterfactual conditions.
arXiv Detail & Related papers (2025-07-31T11:29:42Z)
CompassJudger-2: Towards Generalist Judge Model via Verifiable Rewards [72.44810390478229]
CompassJudger-2 is a novel generalist judge model that overcomes limitations via a task-driven, multi-domain data curation strategy.<n> CompassJudger-2 achieves superior results across multiple judge and reward benchmarks.
arXiv Detail & Related papers (2025-07-12T01:34:24Z)
LLMs Cannot Reliably Judge (Yet?): A Comprehensive Assessment on the Robustness of LLM-as-a-Judge [44.6358611761225]
Large Language Models (LLMs) have demonstrated remarkable intelligence across various tasks.<n>These systems are susceptible to adversarial attacks that can manipulate evaluation outcomes.<n>Existing evaluation methods adopted by LLM-based judges are often piecemeal and lack a unified framework for comprehensive assessment.
arXiv Detail & Related papers (2025-06-11T06:48:57Z)
J4R: Learning to Judge with Equivalent Initial State Group Relative Policy Optimization [69.23273504123941]
We train judges to be robust to positional biases that arise in more complex evaluation settings.<n>We introduce ReasoningJudgeBench, a benchmark that evaluates judges in diverse reasoning settings not covered by prior work.<n>We train Judge for Reasoning (J4R), a 7B judge trained with EIS-GRPO that outperforms GPT-4o and the next best small judge by 6.7% and 9%.
arXiv Detail & Related papers (2025-05-19T16:50:35Z)
FLEX: A Benchmark for Evaluating Robustness of Fairness in Large Language Models [7.221774553388335]
We introduce a new benchmark to test whether Large Language Models can sustain fairness even when exposed to prompts constructed to induce bias. We integrate prompts that amplify potential biases into the fairness assessment. This highlights the need for more stringent evaluation benchmarks to guarantee safety and fairness.
arXiv Detail & Related papers (2025-03-25T10:48:33Z)
LLM-Safety Evaluations Lack Robustness [58.334290876531036]
We argue that current safety alignment research efforts for large language models are hindered by many intertwined sources of noise.<n>We propose a set of guidelines for reducing noise and bias in evaluations of future attack and defense papers.
arXiv Detail & Related papers (2025-03-04T12:55:07Z)
A Framework for Evaluating Vision-Language Model Safety: Building Trust in AI for Public Sector Applications [0.0]
This paper introduces a novel framework to quantify adversarial risks in Vision-Language Models (VLMs) We analyze model performance under Gaussian, salt-and-pepper, and uniform noise, identifying misclassification thresholds and deriving composite noise patches and saliency patterns that highlight vulnerable regions. We propose a new Vulnerability Score that combines the impact of random noise and adversarial attacks, providing a comprehensive metric for evaluating model robustness.
arXiv Detail & Related papers (2025-02-22T21:33:26Z)
PredictaBoard: Benchmarking LLM Score Predictability [50.47497036981544]
Large Language Models (LLMs) often fail unpredictably.<n>This poses a significant challenge to ensuring their safe deployment.<n>We present PredictaBoard, a novel collaborative benchmarking framework.
arXiv Detail & Related papers (2025-02-20T10:52:38Z)
JudgeBench: A Benchmark for Evaluating LLM-based Judges [61.048125269475854]
JudgeBench is a benchmark for evaluating LLM-based judges on challenging response pairs spanning knowledge, reasoning, math, and coding. Our comprehensive evaluation on a collection of prompted judges, fine-tuned judges, multi-agent judges, and reward models shows that JudgeBench poses a significantly greater challenge than previous benchmarks.
arXiv Detail & Related papers (2024-10-16T17:58:19Z)
Criticality and Safety Margins for Reinforcement Learning [53.10194953873209]
We seek to define a criticality framework with both a quantifiable ground truth and a clear significance to users. We introduce true criticality as the expected drop in reward when an agent deviates from its policy for n consecutive random actions. We also introduce the concept of proxy criticality, a low-overhead metric that has a statistically monotonic relationship to true criticality.
arXiv Detail & Related papers (2024-09-26T21:00:45Z)
Jailbreaking as a Reward Misspecification Problem [80.52431374743998]
We propose a novel perspective that attributes this vulnerability to reward misspecification during the alignment process.<n>We introduce a metric ReGap to quantify the extent of reward misspecification and demonstrate its effectiveness.<n>We present ReMiss, a system for automated red teaming that generates adversarial prompts in a reward-misspecified space.
arXiv Detail & Related papers (2024-06-20T15:12:27Z)
Judging the Judges: Evaluating Alignment and Vulnerabilities in LLMs-as-Judges [6.609843448260634]
The LLM-as-a-judge paradigm is rapidly gaining traction as an approach to evaluating large language models.<n>This paper focuses on a clean scenario in which inter-human agreement is high.<n>We identify vulnerabilities in judge models, such as their sensitivity to prompt complexity and length, and a tendency toward leniency.
arXiv Detail & Related papers (2024-06-18T13:49:54Z)
Towards Effective Evaluations and Comparisons for LLM Unlearning Methods [97.2995389188179]
This paper seeks to refine the evaluation of machine unlearning for large language models.<n>It addresses two key challenges -- the robustness of evaluation metrics and the trade-offs between competing goals.
arXiv Detail & Related papers (2024-06-13T14:41:00Z)
Is LLM-as-a-Judge Robust? Investigating Universal Adversarial Attacks on Zero-shot LLM Assessment [8.948475969696075]
Large Language Models (LLMs) are powerful zero-shot assessors used in real-world situations such as assessing written exams and benchmarking systems. We show that short universal adversarial phrases can be deceived to judge LLMs to predict inflated scores. It is found that judge-LLMs are significantly more susceptible to these adversarial attacks when used for absolute scoring.
arXiv Detail & Related papers (2024-02-21T18:55:20Z)
Familiarity-Based Open-Set Recognition Under Adversarial Attacks [9.934489379453812]
We study gradient-based adversarial attacks on familiarity scores for both types of attacks, False Familiarity and False Novelty attacks.<n>We formulate the adversarial reaction score as an alternative OSR scoring rule, which shows a high correlation with the MLS familiarity score.
arXiv Detail & Related papers (2023-11-08T20:17:35Z)
Fairness Evaluation in Presence of Biased Noisy Labels [84.12514975093826]
We propose a sensitivity analysis framework for assessing how assumptions on the noise across groups affect the predictive bias properties of the risk assessment model. Our experimental results on two real world criminal justice data sets demonstrate how even small biases in the observed labels may call into question the conclusions of an analysis based on the noisy outcome.
arXiv Detail & Related papers (2020-03-30T20:47:00Z)

This list is automatically generated from the titles and abstracts of the papers in this site.