Conversational Gold: Evaluating Personalized Conversational Search System using Gold Nuggets
- URL: http://arxiv.org/abs/2503.09902v1
- Date: Wed, 12 Mar 2025 23:44:10 GMT
- Title: Conversational Gold: Evaluating Personalized Conversational Search System using Gold Nuggets
- Authors: Zahra Abbasiantaeb, Simon Lupart, Leif Azzopardi, Jeffery Dalton, Mohammad Aliannejadi,
- Abstract summary: We introduce a new resource for assessing the retrieval effectiveness and relevance of response generated by RAG systems.<n>Our dataset extends to the TREC iKAT 2024 collection, which includes 17 conversations and 20,575 relevance passage assessments.
- Score: 8.734527090842139
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: The rise of personalized conversational search systems has been driven by advancements in Large Language Models (LLMs), enabling these systems to retrieve and generate answers for complex information needs. However, the automatic evaluation of responses generated by Retrieval Augmented Generation (RAG) systems remains an understudied challenge. In this paper, we introduce a new resource for assessing the retrieval effectiveness and relevance of response generated by RAG systems, using a nugget-based evaluation framework. Built upon the foundation of TREC iKAT 2023, our dataset extends to the TREC iKAT 2024 collection, which includes 17 conversations and 20,575 relevance passage assessments, together with 2,279 extracted gold nuggets, and 62 manually written gold answers from NIST assessors. While maintaining the core structure of its predecessor, this new collection enables a deeper exploration of generation tasks in conversational settings. Key improvements in iKAT 2024 include: (1) ``gold nuggets'' -- concise, essential pieces of information extracted from relevant passages of the collection -- which serve as a foundation for automatic response evaluation; (2) manually written answers to provide a gold standard for response evaluation; (3) unanswerable questions to evaluate model hallucination; (4) expanded user personas, providing richer contextual grounding; and (5) a transition from Personal Text Knowledge Base (PTKB) ranking to PTKB classification and selection. Built on this resource, we provide a framework for long-form answer generation evaluation, involving nuggets extraction and nuggets matching, linked to retrieval. This establishes a solid resource for advancing research in personalized conversational search and long-form answer generation. Our resources are publicly available at https://github.com/irlabamsterdam/CONE-RAG.
Related papers
- The Great Nugget Recall: Automating Fact Extraction and RAG Evaluation with Large Language Models [53.12387628636912]
We propose an automatic evaluation framework that is validated against human annotations.
This approach was originally developed for the TREC Question Answering (QA) Track in 2003.
We observe strong agreement at the run level between scores derived from fully automatic nugget evaluation and human-based variants.
arXiv Detail & Related papers (2025-04-21T12:55:06Z) - Retrieval-Augmented Visual Question Answering via Built-in Autoregressive Search Engines [17.803396998387665]
Retrieval-augmented generation (RAG) has emerged to address the knowledge-intensive visual question answering (VQA) task.<n>We propose ReAuSE, an alternative to the previous RAG model for the knowledge-based VQA task.<n>Our model functions both as a generative retriever and an accurate answer generator.
arXiv Detail & Related papers (2025-02-23T16:39:39Z) - From Documents to Dialogue: Building KG-RAG Enhanced AI Assistants [28.149173430599525]
We use a Retrieval-Augmented Generation (RAG) framework powered by a Knowledge Graph (KG) to retrieve relevant information from external knowledge sources.<n>Our KG-RAG system retrieves relevant provenances, which are added to the user prompts context before being sent to the LLM generating the response.<n>Our evaluation demonstrates that this approach significantly enhances response relevance, reducing irrelevant answers by over 50% and increasing fully relevant answers by 88% compared to the existing production system.
arXiv Detail & Related papers (2025-02-21T06:22:12Z) - Initial Nugget Evaluation Results for the TREC 2024 RAG Track with the AutoNuggetizer Framework [53.12387628636912]
This report provides an initial look at partial results from the TREC 2024 Retrieval-Augmented Generation (RAG) Track.
We have identified RAG evaluation as a barrier to continued progress in information access.
arXiv Detail & Related papers (2024-11-14T17:25:43Z) - Do RAG Systems Cover What Matters? Evaluating and Optimizing Responses with Sub-Question Coverage [74.70255719194819]
We introduce a novel framework based on sub-question coverage, which measures how well a RAG system addresses different facets of a question.
We use this framework to evaluate three commercial generative answer engines: You.com, Perplexity AI, and Bing Chat.
We find that while all answer engines cover core sub-questions more often than background or follow-up ones, they still miss around 50% of core sub-questions.
arXiv Detail & Related papers (2024-10-20T22:59:34Z) - TREC iKAT 2023: A Test Collection for Evaluating Conversational and Interactive Knowledge Assistants [10.511277428023305]
The extended TREC Interactive Knowledge Assistance Track (iKAT) collection aims to enable researchers to test and evaluate Conversational Search Agents (CSA)
The collection contains a set of 36 personalized dialogues over 20 different topics each coupled with a Personal Text Knowledge Base (PTKB) that defines the bespoke user personas.
A total of 344 turns with approximately 26,000 passages are provided as assessments on relevance, as well as additional assessments on generated responses over four key dimensions: relevance, completeness, groundedness, and naturalness.
arXiv Detail & Related papers (2024-05-04T11:22:16Z) - Evaluating Generative Ad Hoc Information Retrieval [58.800799175084286]
generative retrieval systems often directly return a grounded generated text as a response to a query.
Quantifying the utility of the textual responses is essential for appropriately evaluating such generative ad hoc retrieval.
arXiv Detail & Related papers (2023-11-08T14:05:00Z) - FCC: Fusing Conversation History and Candidate Provenance for Contextual
Response Ranking in Dialogue Systems [53.89014188309486]
We present a flexible neural framework that can integrate contextual information from multiple channels.
We evaluate our model on the MSDialog dataset widely used for evaluating conversational response ranking tasks.
arXiv Detail & Related papers (2023-03-31T23:58:28Z) - Reranking Overgenerated Responses for End-to-End Task-Oriented Dialogue
Systems [71.33737787564966]
End-to-end (E2E) task-oriented dialogue (ToD) systems are prone to fall into the so-called 'likelihood trap'
We propose a reranking method which aims to select high-quality items from the lists of responses initially overgenerated by the system.
Our methods improve a state-of-the-art E2E ToD system by 2.4 BLEU, 3.2 ROUGE, and 2.8 METEOR scores, achieving new peak results.
arXiv Detail & Related papers (2022-11-07T15:59:49Z) - Open-Retrieval Conversational Question Answering [62.11228261293487]
We introduce an open-retrieval conversational question answering (ORConvQA) setting, where we learn to retrieve evidence from a large collection before extracting answers.
We build an end-to-end system for ORConvQA, featuring a retriever, a reranker, and a reader that are all based on Transformers.
arXiv Detail & Related papers (2020-05-22T19:39:50Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.