RAG-ConfusionQA: A Benchmark for Evaluating LLMs on Confusing Questions
- URL: http://arxiv.org/abs/2410.14567v1
- Date: Fri, 18 Oct 2024 16:11:29 GMT
- Title: RAG-ConfusionQA: A Benchmark for Evaluating LLMs on Confusing Questions
- Authors: Zhiyuan Peng, Jinming Nian, Alexandre Evfimievski, Yi Fang,
- Abstract summary: Conversational AI agents use Retrieval Augmented Generation (RAG) to provide verifiable document-grounded responses to user inquiries.
This paper presents a novel synthetic data generation method to efficiently create a diverse set of context-grounded confusing questions from a given document corpus.
- Score: 52.33835101586687
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Conversational AI agents use Retrieval Augmented Generation (RAG) to provide verifiable document-grounded responses to user inquiries. However, many natural questions do not have good answers: about 25\% contain false assumptions~\cite{Yu2023:CREPE}, and over 50\% are ambiguous~\cite{Min2020:AmbigQA}. RAG agents need high-quality data to improve their responses to confusing questions. This paper presents a novel synthetic data generation method to efficiently create a diverse set of context-grounded confusing questions from a given document corpus. We conduct an empirical comparative evaluation of several large language models as RAG agents to measure the accuracy of confusion detection and appropriate response generation. We contribute a benchmark dataset to the public domain.
Related papers
- LINKAGE: Listwise Ranking among Varied-Quality References for Non-Factoid QA Evaluation via LLMs [61.57691505683534]
Non-Factoid (NF) Question Answering (QA) is challenging to evaluate due to diverse potential answers and no objective criterion.
Large Language Models (LLMs) have been resorted to for NFQA evaluation due to their compelling performance on various NLP tasks.
We propose a novel listwise NFQA evaluation approach, that utilizes LLMs to rank candidate answers in a list of reference answers sorted by descending quality.
arXiv Detail & Related papers (2024-09-23T06:42:21Z) - RAG-QA Arena: Evaluating Domain Robustness for Long-form Retrieval Augmented Question Answering [61.19126689470398]
Long-form RobustQA (LFRQA) is a new dataset covering 26K queries and large corpora across seven different domains.
We show via experiments that RAG-QA Arena and human judgments on answer quality are highly correlated.
Only 41.3% of the most competitive LLM's answers are preferred to LFRQA's answers, demonstrating RAG-QA Arena as a challenging evaluation platform for future research.
arXiv Detail & Related papers (2024-07-19T03:02:51Z) - GenSco: Can Question Decomposition based Passage Alignment improve Question Answering? [1.5776201492893507]
"GenSco" is a novel approach of selecting passages based on the predicted decomposition of the multi-hop questions.
We evaluate on three broadly established multi-hop question answering datasets.
arXiv Detail & Related papers (2024-07-14T15:25:08Z) - Hallucination Detection: Robustly Discerning Reliable Answers in Large Language Models [70.19081534515371]
Large Language Models (LLMs) have gained widespread adoption in various natural language processing tasks.
They generate unfaithful or inconsistent content that deviates from the input source, leading to severe consequences.
We propose a robust discriminator named RelD to effectively detect hallucination in LLMs' generated answers.
arXiv Detail & Related papers (2024-07-04T18:47:42Z) - Optimization of Retrieval-Augmented Generation Context with Outlier Detection [0.0]
We focus on methods to reduce the size and improve the quality of the prompt context required for question-answering systems.
Our goal is to select the most semantically relevant documents, treating the discarded ones as outliers.
It was found that the greatest improvements were achieved with increasing complexity of the questions and answers.
arXiv Detail & Related papers (2024-07-01T15:53:29Z) - Multi-LLM QA with Embodied Exploration [55.581423861790945]
We investigate the use of Multi-Embodied LLM Explorers (MELE) for question-answering in an unknown environment.
Multiple LLM-based agents independently explore and then answer queries about a household environment.
We analyze different aggregation methods to generate a single, final answer for each query.
arXiv Detail & Related papers (2024-06-16T12:46:40Z) - MediQ: Question-Asking LLMs and a Benchmark for Reliable Interactive Clinical Reasoning [36.400896909161006]
We develop systems that proactively ask questions to gather more information and respond reliably.
We introduce a benchmark - MediQ - to evaluate question-asking ability in LLMs.
arXiv Detail & Related papers (2024-06-03T01:32:52Z) - Long-Span Question-Answering: Automatic Question Generation and QA-System Ranking via Side-by-Side Evaluation [65.16137964758612]
We explore the use of long-context capabilities in large language models to create synthetic reading comprehension data from entire books.
Our objective is to test the capabilities of LLMs to analyze, understand, and reason over problems that require a detailed comprehension of long spans of text.
arXiv Detail & Related papers (2024-05-31T20:15:10Z) - Crafting Interpretable Embeddings by Asking LLMs Questions [89.49960984640363]
Large language models (LLMs) have rapidly improved text embeddings for a growing array of natural-language processing tasks.
We introduce question-answering embeddings (QA-Emb), embeddings where each feature represents an answer to a yes/no question asked to an LLM.
We use QA-Emb to flexibly generate interpretable models for predicting fMRI voxel responses to language stimuli.
arXiv Detail & Related papers (2024-05-26T22:30:29Z) - CONFLARE: CONFormal LArge language model REtrieval [0.0]
Retrieval-augmented generation (RAG) frameworks enable large language models (LLMs) to retrieve relevant information from a knowledge base and incorporate it into the context for generating responses.
RAG does not guarantee valid responses if retrieval fails to identify the necessary information as the context for response generation.
We introduce a four-step framework for applying conformal prediction to quantify retrieval uncertainty in RAG frameworks.
arXiv Detail & Related papers (2024-04-04T02:58:21Z) - Chain-of-Discussion: A Multi-Model Framework for Complex Evidence-Based Question Answering [55.295699268654545]
We propose a novel Chain-ofDiscussion framework to leverage the synergy among open-source Large Language Models.
Our experiments show that discussions among multiple LLMs play a vital role in enhancing the quality of answers.
arXiv Detail & Related papers (2024-02-26T05:31:34Z) - Narrowing the Knowledge Evaluation Gap: Open-Domain Question Answering with Multi-Granularity Answers [21.814007454504978]
We present a novel evaluation setting where a predicted answer is evaluated in terms of accuracy and informativeness against a set of multi-granularity answers.
Our experiments show that large language models with standard decoding tend to generate specific answers, which are often incorrect.
When evaluated on multi-granularity answers, DRAG yields a nearly 20 point increase in accuracy on average, which further increases for rare entities.
arXiv Detail & Related papers (2024-01-09T17:44:36Z) - Knowing What LLMs DO NOT Know: A Simple Yet Effective Self-Detection Method [36.24876571343749]
Large Language Models (LLMs) have shown great potential in Natural Language Processing (NLP) tasks.
Recent literature reveals that LLMs generate nonfactual responses intermittently.
We propose a novel self-detection method to detect which questions that a LLM does not know that are prone to generate nonfactual results.
arXiv Detail & Related papers (2023-10-27T06:22:14Z) - SQUARE: Automatic Question Answering Evaluation using Multiple Positive
and Negative References [73.67707138779245]
We propose a new evaluation metric: SQuArE (Sentence-level QUestion AnsweRing Evaluation)
We evaluate SQuArE on both sentence-level extractive (Answer Selection) and generative (GenQA) QA systems.
arXiv Detail & Related papers (2023-09-21T16:51:30Z) - An Empirical Comparison of LM-based Question and Answer Generation
Methods [79.31199020420827]
Question and answer generation (QAG) consists of generating a set of question-answer pairs given a context.
In this paper, we establish baselines with three different QAG methodologies that leverage sequence-to-sequence language model (LM) fine-tuning.
Experiments show that an end-to-end QAG model, which is computationally light at both training and inference times, is generally robust and outperforms other more convoluted approaches.
arXiv Detail & Related papers (2023-05-26T14:59:53Z) - Check Your Facts and Try Again: Improving Large Language Models with
External Knowledge and Automated Feedback [127.75419038610455]
Large language models (LLMs) are able to generate human-like, fluent responses for many downstream tasks.
This paper proposes a LLM-Augmenter system, which augments a black-box LLM with a set of plug-and-play modules.
arXiv Detail & Related papers (2023-02-24T18:48:43Z) - RQUGE: Reference-Free Metric for Evaluating Question Generation by
Answering the Question [29.18544401904503]
We propose a new metric, RQUGE, based on the answerability of the candidate question given the context.
We demonstrate that RQUGE has a higher correlation with human judgment without relying on the reference question.
arXiv Detail & Related papers (2022-11-02T21:10:09Z) - Guided Transformer: Leveraging Multiple External Sources for
Representation Learning in Conversational Search [36.64582291809485]
Asking clarifying questions in response to ambiguous or faceted queries has been recognized as a useful technique for various information retrieval systems.
In this paper, we enrich the representations learned by Transformer networks using a novel attention mechanism from external information sources.
Our experiments use a public dataset for search clarification and demonstrate significant improvements compared to competitive baselines.
arXiv Detail & Related papers (2020-06-13T03:24:53Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.