Human-Centered Evaluation of RAG outputs: a framework and questionnaire for human-AI collaboration
- URL: http://arxiv.org/abs/2509.26205v1
- Date: Tue, 30 Sep 2025 13:08:33 GMT
- Title: Human-Centered Evaluation of RAG outputs: a framework and questionnaire for human-AI collaboration
- Authors: Aline Mangold, Kiran Hoffmann,
- Abstract summary: We designed a human-centred questionnaire that assesses RAG outputs across 12 dimensions.<n>We incorporated feedback from both a human rater and a human-LLM pair.<n>Results indicate that while large language models (LLMs) reliably focus on metric descriptions and scale labels, they exhibit weaknesses in detecting textual format variations.
- Score: 0.0
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: Retrieval-augmented generation (RAG) systems are increasingly deployed in user-facing applications, yet systematic, human-centered evaluation of their outputs remains underexplored. Building on Gienapp's utility-dimension framework, we designed a human-centred questionnaire that assesses RAG outputs across 12 dimensions. We iteratively refined the questionnaire through several rounds of ratings on a set of query-output pairs and semantic discussions. Ultimately, we incorporated feedback from both a human rater and a human-LLM pair. Results indicate that while large language models (LLMs) reliably focus on metric descriptions and scale labels, they exhibit weaknesses in detecting textual format variations. Humans struggled to focus strictly on metric descriptions and labels. LLM ratings and explanations were viewed as a helpful support, but numeric LLM and human ratings lacked agreement. The final questionnaire extends the initial framework by focusing on user intent, text structuring, and information verifiability.
Related papers
- Exploring the features used for summary evaluation by Human and GPT [13.525340904948829]
We study features aligned with Human and Generative Pre-trained Transformers responses by studying statistical and machine learning metrics.<n>We show that instructing GPTs to employ metrics used by Human can improve their judgment and conforming them better with human responses.
arXiv Detail & Related papers (2025-12-22T17:54:49Z) - Human-MME: A Holistic Evaluation Benchmark for Human-Centric Multimodal Large Language Models [118.44328586173556]
Multimodal Large Language Models (MLLMs) have demonstrated significant advances in visual understanding tasks.<n>Human-MME is a curated benchmark designed to provide a more holistic evaluation of MLLMs in human-centric scene understanding.<n>Our benchmark extends the single-target understanding to the multi-person and multi-image mutual understanding.
arXiv Detail & Related papers (2025-09-30T12:20:57Z) - OpinioRAG: Towards Generating User-Centric Opinion Highlights from Large-scale Online Reviews [12.338320566839483]
We study the problem of opinion highlights generation from large volumes of user reviews.<n>Existing methods either fail to scale or produce generic, one-size-fits-all summaries that overlook personalized needs.<n>We introduce OpinioRAG, a scalable, training-free framework that combines RAG-based evidence retrieval with LLMs to efficiently produce tailored summaries.
arXiv Detail & Related papers (2025-08-30T00:00:34Z) - UniSumEval: Towards Unified, Fine-Grained, Multi-Dimensional Summarization Evaluation for LLMs [19.097842830790405]
Existing benchmarks for summarization quality evaluation often lack diverse input scenarios and focus on narrowly defined dimensions.
We create UniSumEval benchmark, which extends the range of input context and provides fine-grained, multi-dimensional annotations.
arXiv Detail & Related papers (2024-09-30T02:56:35Z) - Long-Span Question-Answering: Automatic Question Generation and QA-System Ranking via Side-by-Side Evaluation [65.16137964758612]
We explore the use of long-context capabilities in large language models to create synthetic reading comprehension data from entire books.
Our objective is to test the capabilities of LLMs to analyze, understand, and reason over problems that require a detailed comprehension of long spans of text.
arXiv Detail & Related papers (2024-05-31T20:15:10Z) - PROXYQA: An Alternative Framework for Evaluating Long-Form Text Generation with Large Language Models [72.57329554067195]
ProxyQA is an innovative framework dedicated to assessing longtext generation.
It comprises in-depth human-curated meta-questions spanning various domains, each accompanied by specific proxy-questions with pre-annotated answers.
It assesses the generated content's quality through the evaluator's accuracy in addressing the proxy-questions.
arXiv Detail & Related papers (2024-01-26T18:12:25Z) - Using Natural Language Explanations to Rescale Human Judgments [81.66697572357477]
We propose a method to rescale ordinal annotations and explanations using large language models (LLMs)<n>We feed annotators' Likert ratings and corresponding explanations into an LLM and prompt it to produce a numeric score anchored in a scoring rubric.<n>Our method rescales the raw judgments without impacting agreement and brings the scores closer to human judgments grounded in the same scoring rubric.
arXiv Detail & Related papers (2023-05-24T06:19:14Z) - Large Language Models are Diverse Role-Players for Summarization
Evaluation [82.31575622685902]
A document summary's quality can be assessed by human annotators on various criteria, both objective ones like grammar and correctness, and subjective ones like informativeness, succinctness, and appeal.
Most of the automatic evaluation methods like BLUE/ROUGE may be not able to adequately capture the above dimensions.
We propose a new evaluation framework based on LLMs, which provides a comprehensive evaluation framework by comparing generated text and reference text from both objective and subjective aspects.
arXiv Detail & Related papers (2023-03-27T10:40:59Z) - Towards a Unified Multi-Dimensional Evaluator for Text Generation [101.47008809623202]
We propose a unified multi-dimensional evaluator UniEval for Natural Language Generation (NLG)
We re-frame NLG evaluation as a Boolean Question Answering (QA) task, and by guiding the model with different questions, we can use one evaluator to evaluate from multiple dimensions.
Experiments on three typical NLG tasks show that UniEval correlates substantially better with human judgments than existing metrics.
arXiv Detail & Related papers (2022-10-13T17:17:03Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.