Fine Grained Evaluation of LLMs-as-Judges
- URL: http://arxiv.org/abs/2601.08919v1
- Date: Tue, 13 Jan 2026 19:01:16 GMT
- Title: Fine Grained Evaluation of LLMs-as-Judges
- Authors: Sourav Saha, Mandar Mitra,
- Abstract summary: Large Language Models (LLMs) may be used as judges' in place of humans.<n>We evaluate the quality of LLMs as judges not only at the document level, but to also quantify how often these judges' are right for the right reasons.
- Score: 1.5267938856942276
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: A good deal of recent research has focused on how Large Language Models (LLMs) may be used as `judges' in place of humans to evaluate the quality of the output produced by various text / image processing systems. Within this broader context, a number of studies have investigated the specific question of how effectively LLMs can be used as relevance assessors for the standard ad hoc task in Information Retrieval (IR). We extend these studies by looking at additional questions. Most importantly, we use a Wikipedia based test collection created by the INEX initiative, and prompt LLMs to not only judge whether documents are relevant / non-relevant, but to highlight relevant passages in documents that it regards as useful. The human relevance assessors involved in creating this collection were given analogous instructions, i.e., they were asked to highlight all passages within a document that respond to the information need expressed in a query. This enables us to evaluate the quality of LLMs as judges not only at the document level, but to also quantify how often these `judges' are right for the right reasons. Our findings suggest that LLMs-as-judges work best under human supervision.
Related papers
- LLM-Specific Utility: A New Perspective for Retrieval-Augmented Generation [110.610512800947]
Retrieval-augmented generation (RAG) enhances large language models (LLMs) by incorporating external knowledge.<n>Existing studies often treat utility as a generic attribute, ignoring the fact that different LLMs may benefit differently from the same passage.
arXiv Detail & Related papers (2025-10-13T12:57:45Z) - AISysRev -- LLM-based Tool for Title-abstract Screening [0.7758046038799246]
AiSysRev is a web application running in a Docker container for screening papers.<n>It accepts a CSV file containing paper titles and abstracts.<n>Users specify inclusion and exclusion criteria.<n>It supports both zero-shot and few-shot screening.
arXiv Detail & Related papers (2025-10-08T06:59:23Z) - Rankers, Judges, and Assistants: Towards Understanding the Interplay of LLMs in Information Retrieval Evaluation [44.58099275559231]
Large language models (LLMs) are increasingly integral to information retrieval (IR), powering ranking, evaluation, and AI-assisted content creation.<n>This paper synthesizes existing research and presents novel experiment designs that explore how LLM-based rankers and assistants influence LLM-based judges.
arXiv Detail & Related papers (2025-03-24T19:24:40Z) - LLMs can be Fooled into Labelling a Document as Relevant (best café near me; this paper is perfectly relevant) [26.996231897558324]
This work reports on experiments to study the labelling of short texts (i.e., passages) for relevance using multiple open-source and proprietary LLMs.<n>While the overall agreement of some LLMs with human judgements is comparable to human-to-human agreement measured in previous research, LLMs are more likely to label passages as relevant compared to human judges.
arXiv Detail & Related papers (2025-01-29T20:11:35Z) - LLMs-as-Judges: A Comprehensive Survey on LLM-based Evaluation Methods [21.601196380989542]
''LLMs-as-judges'' are evaluators based on natural language responses.<n>This paper presents a comprehensive survey of the ''LLMs-as-judges'' paradigm from five key perspectives.<n>We aim to provide insights on the development and application of ''LLMs-as-judges'' in both research and practice.
arXiv Detail & Related papers (2024-12-07T08:07:24Z) - From Generation to Judgment: Opportunities and Challenges of LLM-as-a-judge [43.278175460454975]
Assessment and evaluation have long been critical challenges in artificial intelligence (AI) and natural language processing (NLP)<n>Recent advancements in Large Language Models (LLMs) inspire the "LLM-as-a-judge" paradigm.<n>LLMs are leveraged to perform scoring, ranking, or selection for various machine learning evaluation scenarios.
arXiv Detail & Related papers (2024-11-25T17:28:44Z) - JudgeRank: Leveraging Large Language Models for Reasoning-Intensive Reranking [81.88787401178378]
We introduce JudgeRank, a novel agentic reranker that emulates human cognitive processes when assessing document relevance.
We evaluate JudgeRank on the reasoning-intensive BRIGHT benchmark, demonstrating substantial performance improvements over first-stage retrieval methods.
In addition, JudgeRank performs on par with fine-tuned state-of-the-art rerankers on the popular BEIR benchmark, validating its zero-shot generalization capability.
arXiv Detail & Related papers (2024-10-31T18:43:12Z) - Evaluating the Evaluator: Measuring LLMs' Adherence to Task Evaluation Instructions [18.93335792080899]
We investigate how much influence prompting the LLMs-as-a-judge has on the alignment of AI judgements to human judgements.
We aggregate a taxonomy of quality criteria commonly used across state-of-the-art evaluations with LLMs and provide this as a rigorous benchmark of models as judges.
arXiv Detail & Related papers (2024-08-16T14:49:35Z) - LLMs Assist NLP Researchers: Critique Paper (Meta-)Reviewing [106.45895712717612]
Large language models (LLMs) have shown remarkable versatility in various generative tasks.
This study focuses on the topic of LLMs assist NLP Researchers.
To our knowledge, this is the first work to provide such a comprehensive analysis.
arXiv Detail & Related papers (2024-06-24T01:30:22Z) - DnA-Eval: Enhancing Large Language Model Evaluation through Decomposition and Aggregation [75.81096662788254]
Large Language Models (LLMs) are scalable and economical evaluators.<n>The question of how reliable these evaluators are has emerged as a crucial research question.<n>We propose Decompose and Aggregate, which breaks down the evaluation process into different stages based on pedagogical practices.
arXiv Detail & Related papers (2024-05-24T08:12:30Z) - Investigating the Factual Knowledge Boundary of Large Language Models with Retrieval Augmentation [109.8527403904657]
We show that large language models (LLMs) possess unwavering confidence in their knowledge and cannot handle the conflict between internal and external knowledge well.
Retrieval augmentation proves to be an effective approach in enhancing LLMs' awareness of knowledge boundaries.
We propose a simple method to dynamically utilize supporting documents with our judgement strategy.
arXiv Detail & Related papers (2023-07-20T16:46:10Z) - Perspectives on Large Language Models for Relevance Judgment [56.935731584323996]
Large language models (LLMs) claim that they can assist with relevance judgments.
It is not clear whether automated judgments can reliably be used in evaluations of retrieval systems.
arXiv Detail & Related papers (2023-04-13T13:08:38Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.