Quantitative LLM Judges
- URL: http://arxiv.org/abs/2506.02945v2
- Date: Thu, 23 Oct 2025 03:21:35 GMT
- Title: Quantitative LLM Judges
- Authors: Aishwarya Sahoo, Jeevana Kruthi Karnuthala, Tushar Parmanand Budhwani, Pranchal Agarwal, Sankaran Vaidyanathan, Alexa Siu, Franck Dernoncourt, Jennifer Healey, Nedim Lipka, Ryan Rossi, Uttaran Bhattacharya, Branislav Kveton,
- Abstract summary: We propose quantitative LLM judges, which align evaluation scores of existing LLM judges to humans in a given domain.<n>The models are trained to improve the score of the original judge using its rationale and score.<n>Our experiments show that quantitative judges can improve the predictive power of existing judges through post-hoc modeling.
- Score: 60.773734899532336
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: LLM-as-a-judge is a framework where a large language model (LLM) evaluates the output of another LLM. While LLMs excel at producing qualitative textual evaluations, they often struggle to predict human preferences and numeric scores. We propose quantitative LLM judges, which align evaluation scores of existing LLM judges to humans in a given domain using regression models. The models are trained to improve the score of the original judge using its rationale and score. We present four quantitative judges for different types of absolute and relative feedback, which showcases the generality and versatility of our framework. Our framework is more computationally efficient than supervised fine-tuning and can be more statistically efficient when human feedback is limited, which is expected in practice. We validate these claims empirically on four datasets using two base judges. Our experiments show that quantitative judges can improve the predictive power of existing judges through post-hoc modeling.
Related papers
- Who Judges the Judge? LLM Jury-on-Demand: Building Trustworthy LLM Evaluation Systems [2.9141470183751674]
We propose a dynamic, learning-based framework for scalable and context-aware evaluation.<n>Our method trains a set of reliability predictors to assess when LLM judges will agree with human experts.<n> Experiments on summarization and RAG benchmarks show that our dynamic jury system achieves significantly higher correlation with human judgment than both single-judge and static-jury baselines.
arXiv Detail & Related papers (2025-12-01T15:26:20Z) - CompassJudger-2: Towards Generalist Judge Model via Verifiable Rewards [72.44810390478229]
CompassJudger-2 is a novel generalist judge model that overcomes limitations via a task-driven, multi-domain data curation strategy.<n> CompassJudger-2 achieves superior results across multiple judge and reward benchmarks.
arXiv Detail & Related papers (2025-07-12T01:34:24Z) - Evaluating Scoring Bias in LLM-as-a-Judge [8.67484421243584]
Large Language Models (LLMs) are employed as evaluators for complex tasks.<n>There are various biases within LLM-as-a-Judge, which adversely affect the fairness and reliability of judgments.
arXiv Detail & Related papers (2025-06-27T15:25:23Z) - J4R: Learning to Judge with Equivalent Initial State Group Relative Policy Optimization [69.23273504123941]
We train judges to be robust to positional biases that arise in more complex evaluation settings.<n>We introduce ReasoningJudgeBench, a benchmark that evaluates judges in diverse reasoning settings not covered by prior work.<n>We train Judge for Reasoning (J4R), a 7B judge trained with EIS-GRPO that outperforms GPT-4o and the next best small judge by 6.7% and 9%.
arXiv Detail & Related papers (2025-05-19T16:50:35Z) - Evaluating Judges as Evaluators: The JETTS Benchmark of LLM-as-Judges as Test-Time Scaling Evaluators [66.83088028268318]
This paper introduces the Judge Evaluation for Test-Time Scaling benchmark.<n>It evaluates judge performance in three domains (math reasoning, code generation, and instruction following) under three task settings.<n>Our benchmark shows that while judges are competitive with outcome reward models in reranking, they are consistently worse than process reward models in beam search procedures.
arXiv Detail & Related papers (2025-04-21T17:33:23Z) - Judge as A Judge: Improving the Evaluation of Retrieval-Augmented Generation through the Judge-Consistency of Large Language Models [68.92020689188887]
Retrieval-Augmented Generation (RAG) has proven its effectiveness in alleviating hallucinations for Large Language Models (LLMs)<n>Existing automated evaluation metrics cannot fairly evaluate the outputs generated by RAG models during training and evaluation.<n>This paper introduces the Judge-Consistency (ConsJudge) method, which aims to enhance LLMs to generate more accurate evaluations for RAG models.
arXiv Detail & Related papers (2025-02-26T04:50:43Z) - JuStRank: Benchmarking LLM Judges for System Ranking [7.507819077549208]
We conduct the first large-scale study of LLM judges as system rankers.<n>System scores are generated by aggregating judgment scores over multiple system outputs.<n>Our analysis provides a fine-grained characterization of judge behavior, including their decisiveness and bias.
arXiv Detail & Related papers (2024-12-12T18:51:13Z) - JudgeBench: A Benchmark for Evaluating LLM-based Judges [61.048125269475854]
JudgeBench is a benchmark for evaluating LLM-based judges on challenging response pairs spanning knowledge, reasoning, math, and coding.<n>Our comprehensive evaluation on a collection of prompted judges, fine-tuned judges, multi-agent judges, and reward models shows that JudgeBench poses a significantly greater challenge than previous benchmarks.
arXiv Detail & Related papers (2024-10-16T17:58:19Z) - From Calculation to Adjudication: Examining LLM judges on Mathematical Reasoning Tasks [11.01213914485374]
We study large language models (LLMs) on mathematical reasoning tasks.<n>Our analysis uncovers a strong correlation between judgment performance and the candidate model task performance.<n>As a consequence, we test whether we can predict the behavior of LLM judges using simple features such as part-of-speech tags.
arXiv Detail & Related papers (2024-09-06T10:09:41Z) - Evaluating the Evaluator: Measuring LLMs' Adherence to Task Evaluation Instructions [18.93335792080899]
We investigate how much influence prompting the LLMs-as-a-judge has on the alignment of AI judgements to human judgements.
We aggregate a taxonomy of quality criteria commonly used across state-of-the-art evaluations with LLMs and provide this as a rigorous benchmark of models as judges.
arXiv Detail & Related papers (2024-08-16T14:49:35Z) - LLMs instead of Human Judges? A Large Scale Empirical Study across 20 NLP Evaluation Tasks [106.09361690937618]
There is an increasing trend towards evaluating NLP models with LLMs instead of human judgments.<n>We provide JUDGE-BENCH, a collection of 20 NLP datasets with human annotations covering a broad range of evaluated properties and types of data.<n>We evaluate 11 current LLMs, covering both open-weight and proprietary models, for their ability to replicate the annotations.
arXiv Detail & Related papers (2024-06-26T14:56:13Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.