Related papers: Judge as A Judge: Improving the Evaluation of Retrieval-Augmented Generation through the Judge-Consistency of Large Language Models

Judge as A Judge: Improving the Evaluation of Retrieval-Augmented Generation through the Judge-Consistency of Large Language Models

URL: http://arxiv.org/abs/2502.18817v1
Date: Wed, 26 Feb 2025 04:50:43 GMT
Title: Judge as A Judge: Improving the Evaluation of Retrieval-Augmented Generation through the Judge-Consistency of Large Language Models
Authors: Shuliang Liu, Xinze Li, Zhenghao Liu, Yukun Yan, Cheng Yang, Zheni Zeng, Zhiyuan Liu, Maosong Sun, Ge Yu,
Abstract summary: Retrieval-Augmented Generation (RAG) has proven its effectiveness in alleviating hallucinations for Large Language Models (LLMs)<n>Existing automated evaluation metrics cannot fairly evaluate the outputs generated by RAG models during training and evaluation.<n>This paper introduces the Judge-Consistency (ConsJudge) method, which aims to enhance LLMs to generate more accurate evaluations for RAG models.
Score: 68.92020689188887
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Retrieval-Augmented Generation (RAG) has proven its effectiveness in alleviating hallucinations for Large Language Models (LLMs). However, existing automated evaluation metrics cannot fairly evaluate the outputs generated by RAG models during training and evaluation. LLM-based judgment models provide the potential to produce high-quality judgments, but they are highly sensitive to evaluation prompts, leading to inconsistencies when judging the output of RAG models. This paper introduces the Judge-Consistency (ConsJudge) method, which aims to enhance LLMs to generate more accurate evaluations for RAG models. Specifically, ConsJudge prompts LLMs to generate different judgments based on various combinations of judgment dimensions, utilize the judge-consistency to evaluate these judgments and select the accepted and rejected judgments for DPO training. Our experiments show that ConsJudge can effectively provide more accurate judgments for optimizing RAG models across various RAG models and datasets. Further analysis reveals that judgments generated by ConsJudge have a high agreement with the superior LLM. All codes are available at https://github.com/OpenBMB/ConsJudge.

Related papers

CompassJudger-2: Towards Generalist Judge Model via Verifiable Rewards [72.44810390478229]
CompassJudger-2 is a novel generalist judge model that overcomes limitations via a task-driven, multi-domain data curation strategy.<n> CompassJudger-2 achieves superior results across multiple judge and reward benchmarks.
arXiv Detail & Related papers (2025-07-12T01:34:24Z)
Evaluating Scoring Bias in LLM-as-a-Judge [8.751901240110888]
Large Language Models (LLMs) are employed as evaluators for complex tasks.<n>There are various biases within LLM-as-a-Judge, which adversely affect the fairness and reliability of judgments.
arXiv Detail & Related papers (2025-06-27T15:25:23Z)
Quantitative LLM Judges [48.676042957523045]
We propose quantitative LLM judges, which align evaluation scores of existing LLM judges to human scores in a given domain.<n>The models are trained to improve the score of the original judge by using the judge's textual evaluation and score.<n>Our experiments show that quantitative judges can effectively improve the predictive power of existing judges through post-hoc modeling.
arXiv Detail & Related papers (2025-06-03T14:44:23Z)
J4R: Learning to Judge with Equivalent Initial State Group Relative Policy Optimization [69.23273504123941]
We train judges to be robust to positional biases that arise in more complex evaluation settings.<n>We introduce ReasoningJudgeBench, a benchmark that evaluates judges in diverse reasoning settings not covered by prior work.<n>We train Judge for Reasoning (J4R), a 7B judge trained with EIS-GRPO that outperforms GPT-4o and the next best small judge by 6.7% and 9%.
arXiv Detail & Related papers (2025-05-19T16:50:35Z)
Evaluating Judges as Evaluators: The JETTS Benchmark of LLM-as-Judges as Test-Time Scaling Evaluators [66.83088028268318]
This paper introduces the Judge Evaluation for Test-Time Scaling benchmark. It evaluates judge performance in three domains (math reasoning, code generation, and instruction following) under three task settings. Our benchmark shows that while judges are competitive with outcome reward models in reranking, they are consistently worse than process reward models in beam search procedures.
arXiv Detail & Related papers (2025-04-21T17:33:23Z)
Pairwise or Pointwise? Evaluating Feedback Protocols for Bias in LLM-Based Evaluation [57.380464382910375]
We show that the choice of feedback protocol can significantly affect evaluation reliability and induce systematic biases. In particular, we show that pairwise evaluation protocols are more vulnerable to distracted evaluation.
arXiv Detail & Related papers (2025-04-20T19:05:59Z)
Bi'an: A Bilingual Benchmark and Model for Hallucination Detection in Retrieval-Augmented Generation [6.549143816134529]
We introduce bftextBi'an, a novel framework featuring a bilingual benchmark dataset and lightweight judge models.<n>The dataset supports rigorous evaluation across multiple RAG scenarios, while the judge models are fine-tuned from compact open-source LLMs.
arXiv Detail & Related papers (2025-02-26T15:12:59Z)
JudgeRank: Leveraging Large Language Models for Reasoning-Intensive Reranking [81.88787401178378]
We introduce JudgeRank, a novel agentic reranker that emulates human cognitive processes when assessing document relevance. We evaluate JudgeRank on the reasoning-intensive BRIGHT benchmark, demonstrating substantial performance improvements over first-stage retrieval methods. In addition, JudgeRank performs on par with fine-tuned state-of-the-art rerankers on the popular BEIR benchmark, validating its zero-shot generalization capability.
arXiv Detail & Related papers (2024-10-31T18:43:12Z)
JudgeBench: A Benchmark for Evaluating LLM-based Judges [61.048125269475854]
JudgeBench is a benchmark for evaluating LLM-based judges on challenging response pairs spanning knowledge, reasoning, math, and coding. Our comprehensive evaluation on a collection of prompted judges, fine-tuned judges, multi-agent judges, and reward models shows that JudgeBench poses a significantly greater challenge than previous benchmarks.
arXiv Detail & Related papers (2024-10-16T17:58:19Z)
Self-rationalization improves LLM as a fine-grained judge [21.917301609125417]
We introduce Self-Rationalization, an iterative process of improving the rationales for the judge models. Self-rationalization works by having the model generate multiple judgments with rationales for the same input. We show that our model learns to produce higher quality rationales, with a win rate of $62%$ on average compared to models just trained via SFT on rationale.
arXiv Detail & Related papers (2024-10-07T21:05:53Z)
Direct Judgement Preference Optimization [66.83088028268318]
We train large language models (LLMs) as generative judges to evaluate and critique other models' outputs. We employ three approaches to collect the preference pairs for different use cases, each aimed at improving our generative judge from a different perspective. Our model robustly counters inherent biases such as position and length bias, flexibly adapts to any evaluation protocol specified by practitioners, and provides helpful language feedback for improving downstream generator models.
arXiv Detail & Related papers (2024-09-23T02:08:20Z)
Self-Evaluation Improves Selective Generation in Large Language Models [54.003992911447696]
We reformulate open-ended generation tasks into token-level prediction tasks. We instruct an LLM to self-evaluate its answers. We benchmark a range of scoring methods based on self-evaluation.
arXiv Detail & Related papers (2023-12-14T19:09:22Z)

This list is automatically generated from the titles and abstracts of the papers in this site.