Topic-Specific Classifiers are Better Relevance Judges than Prompted LLMs
- URL: http://arxiv.org/abs/2510.04633v1
- Date: Mon, 06 Oct 2025 09:38:13 GMT
- Title: Topic-Specific Classifiers are Better Relevance Judges than Prompted LLMs
- Authors: Lukas Gienapp, Martin Potthast, Harrisen Scells, Eugene Yang,
- Abstract summary: The unjudged document problem is a key obstacle to the reusability of test collections in information retrieval.<n>We train topic-specific relevance classifiers by finetuning monoT5 with independent LoRA weight adaptation on the judgments of a single assessor.<n>As little as 128 initial human judgments per topic suffice to improve the comparability of models.
- Score: 34.14678608130442
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: The unjudged document problem, where pooled test collections have incomplete relevance judgments for evaluating new retrieval systems, is a key obstacle to the reusability of test collections in information retrieval. While the de facto standard to deal with the problem is to treat unjudged documents as non-relevant, many alternatives have been proposed, including the use of large language models (LLMs) as a relevance judge (LLM-as-a-judge). However, this has been criticized as circular, since the same LLM can be used as a judge and as a ranker at the same time. We propose to train topic-specific relevance classifiers instead: By finetuning monoT5 with independent LoRA weight adaptation on the judgments of a single assessor for a single topic's pool, we align it to that assessor's notion of relevance for the topic. The system rankings obtained through our classifier's relevance judgments achieve a Spearmans' $\rho$ correlation of $>0.95$ with ground truth system rankings. As little as 128 initial human judgments per topic suffice to improve the comparability of models, compared to treating unjudged documents as non-relevant, while achieving more reliability than existing LLM-as-a-judge approaches. Topic-specific relevance classifiers thus are a lightweight and straightforward way to tackle the unjudged document problem, while maintaining human judgments as the gold standard for retrieval evaluation. Code, models, and data are made openly available.
Related papers
- Query-Document Dense Vectors for LLM Relevance Judgment Bias Analysis [4.719505127252616]
Large Language Models (LLMs) have been used as relevance assessors for Information Retrieval (IR) evaluation collection creation.<n>We aim to understand if LLMs make systematic mistakes when judging relevance, rather than just understanding how good they are on average.<n>We introduce a clustering-based framework that embeds query-document (Q-D) pairs into a joint semantic space.
arXiv Detail & Related papers (2026-01-05T03:02:33Z) - Do LLM-judges Align with Human Relevance in Cranfield-style Recommender Evaluation? [40.49875426230813]
This paper investigates whether Large Language Models (LLMs) can serve as reliable automatic judges to address scalability challenges.<n>Using the ML-32M-ext Cranfield-style movie recommendation collection, we first examine the limitations of existing evaluation methodologies.<n>We find that incorporating richer item metadata and longer user histories improves alignment, and that LLM-judge yields high agreement with human-based rankings.
arXiv Detail & Related papers (2025-11-28T16:10:39Z) - Variations in Relevance Judgments and the Shelf Life of Test Collections [50.060833338921945]
We reproduce prior work in the neural retrieval setting, showing that assessor disagreement does not affect system rankings.<n>We observe that some models substantially degrade with our new relevance judgments, and some have already reached the effectiveness of humans as rankers.
arXiv Detail & Related papers (2025-02-28T10:46:56Z) - Tuning LLM Judge Design Decisions for 1/1000 of the Cost [42.06346155380305]
Large Language Models (LLMs) often require costly human annotations.<n>To address this, LLM-based judges have been proposed, which compare the outputs of two LLMs.<n>While several approaches have been proposed, many confounding factors are present between different papers.
arXiv Detail & Related papers (2025-01-24T17:01:14Z) - JudgeBlender: Ensembling Judgments for Automatic Relevance Assessment [28.4353755578306]
Large Language Models (LLMs) have shown promise in generating relevance labels for search tasks.<n>We introduce JudgeBlender, a framework that employs smaller, open-source models to provide relevance judgments.
arXiv Detail & Related papers (2024-12-17T19:04:15Z) - JudgeRank: Leveraging Large Language Models for Reasoning-Intensive Reranking [81.88787401178378]
We introduce JudgeRank, a novel agentic reranker that emulates human cognitive processes when assessing document relevance.
We evaluate JudgeRank on the reasoning-intensive BRIGHT benchmark, demonstrating substantial performance improvements over first-stage retrieval methods.
In addition, JudgeRank performs on par with fine-tuned state-of-the-art rerankers on the popular BEIR benchmark, validating its zero-shot generalization capability.
arXiv Detail & Related papers (2024-10-31T18:43:12Z) - JudgeBench: A Benchmark for Evaluating LLM-based Judges [61.048125269475854]
JudgeBench is a benchmark for evaluating LLM-based judges on challenging response pairs spanning knowledge, reasoning, math, and coding.<n>Our comprehensive evaluation on a collection of prompted judges, fine-tuned judges, multi-agent judges, and reward models shows that JudgeBench poses a significantly greater challenge than previous benchmarks.
arXiv Detail & Related papers (2024-10-16T17:58:19Z) - LLMs Can Patch Up Missing Relevance Judgments in Evaluation [56.51461892988846]
We use large language models (LLMs) to automatically label unjudged documents.
We simulate scenarios with varying degrees of holes by randomly dropping relevant documents from the relevance judgment in TREC DL tracks.
Our method achieves a Kendall tau correlation of 0.87 and 0.92 on an average for Vicuna-7B and GPT-3.5 Turbo respectively.
arXiv Detail & Related papers (2024-05-08T00:32:19Z) - Using Natural Language Explanations to Rescale Human Judgments [81.66697572357477]
We propose a method to rescale ordinal annotations and explanations using large language models (LLMs)<n>We feed annotators' Likert ratings and corresponding explanations into an LLM and prompt it to produce a numeric score anchored in a scoring rubric.<n>Our method rescales the raw judgments without impacting agreement and brings the scores closer to human judgments grounded in the same scoring rubric.
arXiv Detail & Related papers (2023-05-24T06:19:14Z) - Perspectives on Large Language Models for Relevance Judgment [56.935731584323996]
Large language models (LLMs) claim that they can assist with relevance judgments.
It is not clear whether automated judgments can reliably be used in evaluations of retrieval systems.
arXiv Detail & Related papers (2023-04-13T13:08:38Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.