Chatbot Arena Meets Nuggets: Towards Explanations and Diagnostics in the Evaluation of LLM Responses
- URL: http://arxiv.org/abs/2504.20006v1
- Date: Mon, 28 Apr 2025 17:24:36 GMT
- Title: Chatbot Arena Meets Nuggets: Towards Explanations and Diagnostics in the Evaluation of LLM Responses
- Authors: Sahel Sharifymoghaddam, Shivani Upadhyay, Nandan Thakur, Ronak Pradeep, Jimmy Lin,
- Abstract summary: We apply our AutoNuggetizer framework to analyze data from roughly 7K Search Arena battles provided by LMArena.<n>Our results show a significant correlation between nugget scores and human preferences.
- Score: 45.2769075498271
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Battles, or side-by-side comparisons in so called arenas that elicit human preferences, have emerged as a popular approach to assessing the output quality of LLMs. Recently, this idea has been extended to retrieval-augmented generation (RAG) systems. While undoubtedly representing an advance in evaluation, battles have at least two drawbacks, particularly in the context of complex information-seeking queries: they are neither explanatory nor diagnostic. Recently, the nugget evaluation methodology has emerged as a promising approach to evaluate the quality of RAG answers. Nuggets decompose long-form LLM-generated answers into atomic facts, highlighting important pieces of information necessary in a "good" response. In this work, we apply our AutoNuggetizer framework to analyze data from roughly 7K Search Arena battles provided by LMArena in a fully automatic manner. Our results show a significant correlation between nugget scores and human preferences, showcasing promise in our approach to explainable and diagnostic system evaluations.
Related papers
- The Great Nugget Recall: Automating Fact Extraction and RAG Evaluation with Large Language Models [53.12387628636912]
We propose an automatic evaluation framework that is validated against human annotations.<n>This approach was originally developed for the TREC Question Answering (QA) Track in 2003.<n>We observe strong agreement at the run level between scores derived from fully automatic nugget evaluation and human-based variants.
arXiv Detail & Related papers (2025-04-21T12:55:06Z) - On the Influence of Context Size and Model Choice in Retrieval-Augmented Generation Systems [5.69361786082969]
Retrieval-augmented generation (RAG) has emerged as an approach to augment large language models (LLMs)<n>We evaluate various context sizes, BM25 and semantic search as retrievers, and eight base LLMs.<n>Our findings indicate that final QA performance improves steadily with up to 15 snippets but stagnates or declines beyond that.
arXiv Detail & Related papers (2025-02-20T17:34:34Z) - Addressing Hallucinations with RAG and NMISS in Italian Healthcare LLM Chatbots [0.0]
I combine detection and mitigation techniques to addresses hallucinations in Large Language Models (LLMs)
Mitigation is achieved in a question-answer-ing Retrieval-Augmented Generation (RAG) framework while detection is obtained by introducing the Negative Missing Information Scoring System (NMISS)
This combined approach offers new insights into the reduction and more accurate assessment of hallucinations in LLMs, with applications in real-world healthcare tasks and other domains.
arXiv Detail & Related papers (2024-12-05T15:11:12Z) - Comprehensive and Practical Evaluation of Retrieval-Augmented Generation Systems for Medical Question Answering [70.44269982045415]
Retrieval-augmented generation (RAG) has emerged as a promising approach to enhance the performance of large language models (LLMs)
We introduce Medical Retrieval-Augmented Generation Benchmark (MedRGB) that provides various supplementary elements to four medical QA datasets.
Our experimental results reveals current models' limited ability to handle noise and misinformation in the retrieved documents.
arXiv Detail & Related papers (2024-11-14T06:19:18Z) - Do RAG Systems Cover What Matters? Evaluating and Optimizing Responses with Sub-Question Coverage [74.70255719194819]
We introduce a novel framework based on sub-question coverage, which measures how well a RAG system addresses different facets of a question.
We use this framework to evaluate three commercial generative answer engines: You.com, Perplexity AI, and Bing Chat.
We find that while all answer engines cover core sub-questions more often than background or follow-up ones, they still miss around 50% of core sub-questions.
arXiv Detail & Related papers (2024-10-20T22:59:34Z) - Ranking Generated Answers: On the Agreement of Retrieval Models with Humans on Consumer Health Questions [25.158868133182025]
We present a method for evaluating the output of generative large language models (LLMs)<n>We use ranking models trained on annotated document collections as a substitute for explicit relevance.<n>In a user study, our method correlates with the preferences of a human expert.
arXiv Detail & Related papers (2024-08-19T09:27:45Z) - RAG-QA Arena: Evaluating Domain Robustness for Long-form Retrieval Augmented Question Answering [61.19126689470398]
Long-form RobustQA (LFRQA) is a new dataset covering 26K queries and large corpora across seven different domains.
We show via experiments that RAG-QA Arena and human judgments on answer quality are highly correlated.
Only 41.3% of the most competitive LLM's answers are preferred to LFRQA's answers, demonstrating RAG-QA Arena as a challenging evaluation platform for future research.
arXiv Detail & Related papers (2024-07-19T03:02:51Z) - Evaluating the Retrieval Component in LLM-Based Question Answering Systems [1.7013938542585922]
This study proposes a baseline for evaluating retrievers in Retrieval-Augmented Generation (RAG)-based chatbots.
Our findings demonstrate that this evaluation framework provides a better image of how the retriever performs.
Our method considers LLMs' strengths to ignore irrelevant contexts, as well as potential errors and hallucinations in their responses.
arXiv Detail & Related papers (2024-06-10T16:46:22Z) - PICK: Polished & Informed Candidate Scoring for Knowledge-Grounded
Dialogue Systems [59.1250765143521]
Current knowledge-grounded dialogue systems often fail to align the generated responses with human-preferred qualities.
We propose Polished & Informed Candidate Scoring (PICK), a generation re-scoring framework.
We demonstrate the effectiveness of PICK in generating responses that are more faithful while keeping them relevant to the dialogue history.
arXiv Detail & Related papers (2023-09-19T08:27:09Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.