Related papers: The Viability of Crowdsourcing for RAG Evaluation

The Viability of Crowdsourcing for RAG Evaluation

URL: http://arxiv.org/abs/2504.15689v1
Date: Tue, 22 Apr 2025 08:13:34 GMT
Title: The Viability of Crowdsourcing for RAG Evaluation
Authors: Lukas Gienapp, Tim Hagen, Maik Fröbe, Matthias Hagen, Benno Stein, Martin Potthast, Harrisen Scells,
Abstract summary: We present the Crowd RAG Corpus 2025 (CrowdRAG-25), which consists of 903 human-written and 903 LLM-generated responses for the 301 topics of the TREC RAG'24 track.<n>Our analyses give insights into human writing behavior for RAG and the viability of crowdsourcing for RAG evaluation.
Score: 39.275627272019925
License: http://creativecommons.org/licenses/by/4.0/
Abstract: How good are humans at writing and judging responses in retrieval-augmented generation (RAG) scenarios? To answer this question, we investigate the efficacy of crowdsourcing for RAG through two complementary studies: response writing and response utility judgment. We present the Crowd RAG Corpus 2025 (CrowdRAG-25), which consists of 903 human-written and 903 LLM-generated responses for the 301 topics of the TREC RAG'24 track, across the three discourse styles 'bulleted list', 'essay', and 'news'. For a selection of 65 topics, the corpus further contains 47,320 pairwise human judgments and 10,556 pairwise LLM judgments across seven utility dimensions (e.g., coverage and coherence). Our analyses give insights into human writing behavior for RAG and the viability of crowdsourcing for RAG evaluation. Human pairwise judgments provide reliable and cost-effective results compared to LLM-based pairwise or human/LLM-based pointwise judgments, as well as automated comparisons with human-written reference responses. All our data and tools are freely available.

Related papers

Support Evaluation for the TREC 2024 RAG Track: Comparing Human versus LLM Judges [53.12387628636912]
A crucial factor in RAG evaluation is "support", whether the information in the cited documents supports the answer.<n>We conducted a large-scale comparative study of 45 participant submissions on 36 topics to the TREC 2024 RAG Track.<n>Our results indicate that for 56% of the manual from-scratch assessments, human and GPT-4o predictions match perfectly.
arXiv Detail & Related papers (2025-04-21T16:20:43Z)
A Human-AI Comparative Analysis of Prompt Sensitivity in LLM-Based Relevance Judgment [15.255877686845773]
Large Language Models (LLMs) are increasingly used to automate relevance judgments for information retrieval (IR) tasks. To assess the robustness and reliability of LLM-based relevance judgments, we systematically investigate impact of prompt sensitivity on the task.
arXiv Detail & Related papers (2025-04-16T18:17:19Z)
HREF: Human Response-Guided Evaluation of Instruction Following in Language Models [61.273153125847166]
We develop a new evaluation benchmark, Human Response-Guided Evaluation of Instruction Following (HREF) In addition to providing reliable evaluation, HREF emphasizes individual task performance and is free from contamination. We study the impact of key design choices in HREF, including the size of the evaluation set, the judge model, the baseline model, and the prompt template.
arXiv Detail & Related papers (2024-12-20T03:26:47Z)
GLIDER: Grading LLM Interactions and Decisions using Explainable Ranking [0.9614204956530676]
We introduce GLIDER, a powerful 3B evaluator LLM that can score any text input and associated context on arbitrary user defined criteria.<n>GLIDER shows higher Pearson's correlation than GPT-4o on FLASK and greatly outperforms prior evaluation models.<n>It supports fine-grained scoring, multilingual reasoning, span highlighting and was trained on 685 domains and 183 criteria.
arXiv Detail & Related papers (2024-12-18T18:41:12Z)
A Large-Scale Study of Relevance Assessments with Large Language Models: An Initial Look [52.114284476700874]
This paper reports on the results of a large-scale evaluation (the TREC 2024 RAG Track) where four different relevance assessment approaches were deployed. We find that automatically generated UMBRELA judgments can replace fully manual judgments to accurately capture run-level effectiveness. Surprisingly, we find that LLM assistance does not appear to increase correlation with fully manual assessments, suggesting that costs associated with human-in-the-loop processes do not bring obvious tangible benefits.
arXiv Detail & Related papers (2024-11-13T01:12:35Z)
ELOQ: Resources for Enhancing LLM Detection of Out-of-Scope Questions [52.33835101586687]
Large Language Models (LLMs) are widely used in Conversational AI systems to generate responses to user inquiries.<n>We propose a guided hallucination-based method to efficiently generate a diverse set of out-of-scope questions from a given document corpus.
arXiv Detail & Related papers (2024-10-18T16:11:29Z)
RAG-QA Arena: Evaluating Domain Robustness for Long-form Retrieval Augmented Question Answering [61.19126689470398]
Long-form RobustQA (LFRQA) is a new dataset covering 26K queries and large corpora across seven different domains. We show via experiments that RAG-QA Arena and human judgments on answer quality are highly correlated. Only 41.3% of the most competitive LLM's answers are preferred to LFRQA's answers, demonstrating RAG-QA Arena as a challenging evaluation platform for future research.
arXiv Detail & Related papers (2024-07-19T03:02:51Z)
RADE: Reference-Assisted Dialogue Evaluation for Open-Domain Dialogue [37.82954848948347]
We propose the Reference-Assisted Dialogue Evaluation (RADE) approach under the multi-task learning framework. RADE explicitly compares reference and the candidate response to predict their overall scores. Experiments on our three datasets and two existing benchmarks demonstrate the effectiveness of our method.
arXiv Detail & Related papers (2023-09-15T04:47:19Z)
Perspectives on Large Language Models for Relevance Judgment [56.935731584323996]
Large language models (LLMs) claim that they can assist with relevance judgments. It is not clear whether automated judgments can reliably be used in evaluations of retrieval systems.
arXiv Detail & Related papers (2023-04-13T13:08:38Z)

This list is automatically generated from the titles and abstracts of the papers in this site.