Related papers: Validating LLM-Generated Relevance Labels for Educational Resource Search

Validating LLM-Generated Relevance Labels for Educational Resource Search

URL: http://arxiv.org/abs/2504.12732v1
Date: Thu, 17 Apr 2025 08:14:45 GMT
Title: Validating LLM-Generated Relevance Labels for Educational Resource Search
Authors: Ratan J. Sebastian, Anett Hoppe,
Abstract summary: We release a dataset of 401 human relevance judgements from a user study involving teaching professionals performing search tasks related to lesson planning.<n>Using domain-specific frameworks, LLMs achieved strong agreement with human judgements.<n>System-level evaluation showed that LLM judgements reliably identified top-performing retrieval approaches.
Score: 2.2175950967382487
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Manual relevance judgements in Information Retrieval are costly and require expertise, driving interest in using Large Language Models (LLMs) for automatic assessment. While LLMs have shown promise in general web search scenarios, their effectiveness for evaluating domain-specific search results, such as educational resources, remains unexplored. To investigate different ways of including domain-specific criteria in LLM prompts for relevance judgement, we collected and released a dataset of 401 human relevance judgements from a user study involving teaching professionals performing search tasks related to lesson planning. We compared three approaches to structuring these prompts: a simple two-aspect evaluation baseline from prior work on using LLMs as relevance judges, a comprehensive 12-dimensional rubric derived from educational literature, and criteria directly informed by the study participants. Using domain-specific frameworks, LLMs achieved strong agreement with human judgements (Cohen's $\kappa$ up to 0.650), significantly outperforming the baseline approach. The participant-derived framework proved particularly robust, with GPT-3.5 achieving $\kappa$ scores of 0.639 and 0.613 for 10-dimension and 5-dimension versions respectively. System-level evaluation showed that LLM judgements reliably identified top-performing retrieval approaches (RBO scores 0.71-0.76) while maintaining reasonable discrimination between systems (RBO 0.52-0.56). These findings suggest that LLMs can effectively evaluate educational resources when prompted with domain-specific criteria, though performance varies with framework complexity and input structure.

Related papers

Query-Document Dense Vectors for LLM Relevance Judgment Bias Analysis [4.719505127252616]
Large Language Models (LLMs) have been used as relevance assessors for Information Retrieval (IR) evaluation collection creation.<n>We aim to understand if LLMs make systematic mistakes when judging relevance, rather than just understanding how good they are on average.<n>We introduce a clustering-based framework that embeds query-document (Q-D) pairs into a joint semantic space.
arXiv Detail & Related papers (2026-01-05T03:02:33Z)
On Evaluating LLM Alignment by Evaluating LLMs as Judges [68.15541137648721]
evaluating large language models' (LLMs) alignment requires them to be helpful, honest, safe, and to precisely follow human instructions.<n>We examine the relationship between LLMs' generation and evaluation capabilities in aligning with human preferences.<n>We propose a benchmark that assesses alignment without directly evaluating model outputs.
arXiv Detail & Related papers (2025-11-25T18:33:24Z)
Criteria-Based LLM Relevance Judgments [5.478764356647438]
Large Language Models (LLMs) provide a scalable solution by generating relevance labels directly through prompting.<n>We propose the Multi-Criteria framework for LLM-based relevance judgments, decomposing the notion of relevance into multiple criteria.<n>Our results demonstrate that Multi-Criteria judgments enhance the system ranking/leaderboard performance.
arXiv Detail & Related papers (2025-07-13T04:21:21Z)
OpenUnlearning: Accelerating LLM Unlearning via Unified Benchmarking of Methods and Metrics [82.0813150432867]
We introduce OpenUnlearning, a standardized framework for benchmarking large language models (LLMs) unlearning methods and metrics.<n>OpenUnlearning integrates 13 unlearning algorithms and 16 diverse evaluations across 3 leading benchmarks.<n>We also benchmark diverse unlearning methods and provide a comparative analysis against an extensive evaluation suite.
arXiv Detail & Related papers (2025-06-14T20:16:37Z)
LLM-Driven Usefulness Judgment for Web Search Evaluation [12.10711284043516]
Evaluation is fundamental in optimizing search experiences and supporting diverse user intents in Information Retrieval (IR) Traditional search evaluation methods primarily rely on relevance labels, which assess how well retrieved documents match a user's query. In this paper, we explore an alternative approach: LLM-generated usefulness labels, which incorporate both implicit and explicit user behavior signals to evaluate document usefulness.
arXiv Detail & Related papers (2025-04-19T20:38:09Z)
Benchmarking LLM-based Relevance Judgment Methods [15.255877686845773]
Large Language Models (LLMs) are increasingly deployed in both academic and industry settings.<n>We systematically compare multiple LLM-based relevance assessment methods, including binary relevance judgments, graded relevance assessments, pairwise preference-based methods, and two nugget-based evaluation methods.<n>As part of our data release, we include relevance judgments generated by both an open-source (Llama3.2b) and a commercial (gpt-4o) model.
arXiv Detail & Related papers (2025-04-17T01:13:21Z)
Multi-Agent LLM Judge: automatic personalized LLM judge design for evaluating natural language generation applications [0.0]
Large Language Models (LLMs) have demonstrated impressive performance across diverse domains, yet they still encounter challenges such as insufficient domain-specific knowledge, biases, and hallucinations.<n>Traditional evaluation methods, which rely on word overlap or text embeddings, are inadequate for capturing the nuanced semantic information necessary to evaluate dynamic, open-ended text generation.<n>We propose a novel dynamic multi-agent system that automatically designs personalized LLM judges for various natural language generation applications.
arXiv Detail & Related papers (2025-04-01T09:36:56Z)
Latent Factor Models Meets Instructions: Goal-conditioned Latent Factor Discovery without Task Supervision [50.45597801390757]
Instruct-LF is a goal-oriented latent factor discovery system. It integrates instruction-following ability with statistical models to handle noisy datasets.
arXiv Detail & Related papers (2025-02-21T02:03:08Z)
Value Compass Leaderboard: A Platform for Fundamental and Validated Evaluation of LLMs Values [76.70893269183684]
Large Language Models (LLMs) achieve remarkable breakthroughs, aligning their values with humans has become imperative. Existing evaluations focus narrowly on safety risks such as bias and toxicity. Existing benchmarks are prone to data contamination. The pluralistic nature of human values across individuals and cultures is largely ignored in measuring LLMs value alignment.
arXiv Detail & Related papers (2025-01-13T05:53:56Z)
A Large-Scale Study of Relevance Assessments with Large Language Models: An Initial Look [52.114284476700874]
This paper reports on the results of a large-scale evaluation (the TREC 2024 RAG Track) where four different relevance assessment approaches were deployed. We find that automatically generated UMBRELA judgments can replace fully manual judgments to accurately capture run-level effectiveness. Surprisingly, we find that LLM assistance does not appear to increase correlation with fully manual assessments, suggesting that costs associated with human-in-the-loop processes do not bring obvious tangible benefits.
arXiv Detail & Related papers (2024-11-13T01:12:35Z)
Aligning with Human Judgement: The Role of Pairwise Preference in Large Language Model Evaluators [48.54465599914978]
Large Language Models (LLMs) have demonstrated promising capabilities as automatic evaluators in assessing the quality of generated natural language.<n>LLMs still exhibit biases in evaluation and often struggle to generate coherent evaluations that align with human assessments.<n>We introduce Pairwise-preference Search (PAIRS), an uncertainty-guided search-based rank aggregation method that employs LLMs to conduct pairwise comparisons locally and efficiently ranks candidate texts globally.
arXiv Detail & Related papers (2024-03-25T17:11:28Z)
Benchmarking Generation and Evaluation Capabilities of Large Language Models for Instruction Controllable Summarization [132.25202059478065]
We benchmark large language models (LLMs) on instruction controllable text summarization. Our study reveals that instruction controllable text summarization remains a challenging task for LLMs.
arXiv Detail & Related papers (2023-11-15T18:25:26Z)
Which is better? Exploring Prompting Strategy For LLM-based Metrics [6.681126871165601]
This paper describes the DSBA submissions to the Prompting Large Language Models as Explainable Metrics shared task. Traditional similarity-based metrics such as BLEU and ROUGE have shown to misalign with human evaluation and are ill-suited for open-ended generation tasks.
arXiv Detail & Related papers (2023-11-07T06:36:39Z)
L-Eval: Instituting Standardized Evaluation for Long Context Language Models [91.05820785008527]
We propose L-Eval to institute a more standardized evaluation for long context language models (LCLMs) We build a new evaluation suite containing 20 sub-tasks, 508 long documents, and over 2,000 human-labeled query-response pairs. Results show that popular n-gram matching metrics generally can not correlate well with human judgment.
arXiv Detail & Related papers (2023-07-20T17:59:41Z)

This list is automatically generated from the titles and abstracts of the papers in this site.