Related papers: JudgeLM: Fine-tuned Large Language Models are Scalable Judges

JudgeLM: Fine-tuned Large Language Models are Scalable Judges

URL: http://arxiv.org/abs/2310.17631v1
Date: Thu, 26 Oct 2023 17:48:58 GMT
Title: JudgeLM: Fine-tuned Large Language Models are Scalable Judges
Authors: Lianghui Zhu, Xinggang Wang, Xinlong Wang
Abstract summary: We propose to fine-tune Large Language Models (LLMs) as scalable judges (JudgeLM) to evaluate LLMs efficiently and effectively in open-ended benchmarks. We first propose a comprehensive, large-scale, high-quality dataset containing task seeds, LLMs-generated answers, and GPT-4-generated judgments for fine-tuning high-performance judges. We then analyze the key biases in fine-tuning LLM as a judge and consider them as position bias, knowledge bias, and format bias.
Score: 54.007823006976516
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Evaluating Large Language Models (LLMs) in open-ended scenarios is challenging because existing benchmarks and metrics can not measure them comprehensively. To address this problem, we propose to fine-tune LLMs as scalable judges (JudgeLM) to evaluate LLMs efficiently and effectively in open-ended benchmarks. We first propose a comprehensive, large-scale, high-quality dataset containing task seeds, LLMs-generated answers, and GPT-4-generated judgments for fine-tuning high-performance judges, as well as a new benchmark for evaluating the judges. We train JudgeLM at different scales from 7B, 13B, to 33B parameters, and conduct a systematic analysis of its capabilities and behaviors. We then analyze the key biases in fine-tuning LLM as a judge and consider them as position bias, knowledge bias, and format bias. To address these issues, JudgeLM introduces a bag of techniques including swap augmentation, reference support, and reference drop, which clearly enhance the judge's performance. JudgeLM obtains the state-of-the-art judge performance on both the existing PandaLM benchmark and our proposed new benchmark. Our JudgeLM is efficient and the JudgeLM-7B only needs 3 minutes to judge 5K samples with 8 A100 GPUs. JudgeLM obtains high agreement with the teacher judge, achieving an agreement exceeding 90% that even surpasses human-to-human agreement. JudgeLM also demonstrates extended capabilities in being judges of the single answer, multimodal models, multiple answers, and multi-turn chat.

Related papers

Evaluating Judges as Evaluators: The JETTS Benchmark of LLM-as-Judges as Test-Time Scaling Evaluators [66.83088028268318]
This paper introduces the Judge Evaluation for Test-Time Scaling benchmark. It evaluates judge performance in three domains (math reasoning, code generation, and instruction following) under three task settings. Our benchmark shows that while judges are competitive with outcome reward models in reranking, they are consistently worse than process reward models in beam search procedures.
arXiv Detail & Related papers (2025-04-21T17:33:23Z)
Judge as A Judge: Improving the Evaluation of Retrieval-Augmented Generation through the Judge-Consistency of Large Language Models [68.92020689188887]
Retrieval-Augmented Generation (RAG) has proven its effectiveness in alleviating hallucinations for Large Language Models (LLMs) Existing automated evaluation metrics cannot fairly evaluate the outputs generated by RAG models during training and evaluation. This paper introduces the Judge-Consistency (ConsJudge) method, which aims to enhance LLMs to generate more accurate evaluations for RAG models.
arXiv Detail & Related papers (2025-02-26T04:50:43Z)
JudgeRank: Leveraging Large Language Models for Reasoning-Intensive Reranking [81.88787401178378]
We introduce JudgeRank, a novel agentic reranker that emulates human cognitive processes when assessing document relevance. We evaluate JudgeRank on the reasoning-intensive BRIGHT benchmark, demonstrating substantial performance improvements over first-stage retrieval methods. In addition, JudgeRank performs on par with fine-tuned state-of-the-art rerankers on the popular BEIR benchmark, validating its zero-shot generalization capability.
arXiv Detail & Related papers (2024-10-31T18:43:12Z)
JudgeBench: A Benchmark for Evaluating LLM-based Judges [61.048125269475854]
JudgeBench is a benchmark for evaluating LLM-based judges on challenging response pairs spanning knowledge, reasoning, math, and coding. Our comprehensive evaluation on a collection of prompted judges, fine-tuned judges, multi-agent judges, and reward models shows that JudgeBench poses a significantly greater challenge than previous benchmarks.
arXiv Detail & Related papers (2024-10-16T17:58:19Z)
Style Outweighs Substance: Failure Modes of LLM Judges in Alignment Benchmarking [56.275521022148794]
Post-training methods claim superior alignment by virtue of better correspondence with human pairwise preferences. We attempt to answer the question -- do LLM-judge preferences translate to progress on other, more concrete metrics for alignment, and if not, why not? We find that (1) LLM-judge preferences do not correlate with concrete measures of safety, world knowledge, and instruction following; (2) LLM-judges have powerful implicit biases, prioritizing style over factuality and safety; and (3) the supervised fine-tuning stage of post-training, and not the PO stage, has the greatest impact on alignment.
arXiv Detail & Related papers (2024-09-23T17:58:07Z)
From Calculation to Adjudication: Examining LLM judges on Mathematical Reasoning Tasks [11.01213914485374]
We study large language models (LLMs) on mathematical reasoning tasks. Our analysis uncovers a strong correlation between judgment performance and the candidate model task performance. We show that it is possible to use statistics, such as the task performances of the individual models, to predict judgment performance.
arXiv Detail & Related papers (2024-09-06T10:09:41Z)
MJ-Bench: Is Your Multimodal Reward Model Really a Good Judge for Text-to-Image Generation? [59.7772329962047]
We introduce MJ-Bench, a novel benchmark which incorporates a comprehensive preference dataset to evaluate multimodal judges. Specifically, we evaluate a large variety of multimodal judges including smaller-sized CLIP-based scoring models, open-source VLMs, and close-source VLMs. Experiments reveal that close-source VLMs generally provide better feedback, with GPT-4o outperforming other judges in average.
arXiv Detail & Related papers (2024-07-05T20:03:16Z)
Judging the Judges: Evaluating Alignment and Vulnerabilities in LLMs-as-Judges [6.609843448260634]
The LLM-as-a-judge paradigm is rapidly gaining traction as an approach to evaluating large language models. This paper focuses on a clean scenario in which inter-human agreement is high. We identify vulnerabilities in judge models, such as their sensitivity to prompt complexity and length, and a tendency toward leniency.
arXiv Detail & Related papers (2024-06-18T13:49:54Z)
Humans or LLMs as the Judge? A Study on Judgement Biases [17.069314000437537]
We propose a novel framework that is free from referencing groundtruth annotations for investigating Misinformation Oversight Bias, Gender Bias, Authority Bias and Beauty Bias on LLM and human judges. Results show that human and LLM judges are vulnerable to perturbations to various degrees, and that even the cutting-edge judges possess considerable biases.
arXiv Detail & Related papers (2024-02-16T13:21:06Z)
Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena [76.21004582932268]
We examine the usage and limitations of LLM-as-a-judge, including position, verbosity, and self-enhancement biases. We then verify the agreement between LLM judges and human preferences by introducing two benchmarks: MT-bench, a multi-turn question set; and Arena, a crowdsourced battle platform.
arXiv Detail & Related papers (2023-06-09T05:55:52Z)

This list is automatically generated from the titles and abstracts of the papers in this site.