Related papers: Narrowing the Knowledge Evaluation Gap: Open-Domain Question Answering with Multi-Granularity Answers

Narrowing the Knowledge Evaluation Gap: Open-Domain Question Answering with Multi-Granularity Answers

URL: http://arxiv.org/abs/2401.04695v2
Date: Thu, 1 Aug 2024 06:23:58 GMT
Title: Narrowing the Knowledge Evaluation Gap: Open-Domain Question Answering with Multi-Granularity Answers
Authors: Gal Yona, Roee Aharoni, Mor Geva,
Abstract summary: We present a novel evaluation setting where a predicted answer is evaluated in terms of accuracy and informativeness against a set of multi-granularity answers. Our experiments show that large language models with standard decoding tend to generate specific answers, which are often incorrect. When evaluated on multi-granularity answers, DRAG yields a nearly 20 point increase in accuracy on average, which further increases for rare entities.
Score: 21.814007454504978
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Factual questions typically can be answered correctly at different levels of granularity. For example, both ``August 4, 1961'' and ``1961'' are correct answers to the question ``When was Barack Obama born?''. Standard question answering (QA) evaluation protocols, however, do not explicitly take this into account and compare a predicted answer against answers of a single granularity level. In this work, we propose GRANOLA QA, a novel evaluation setting where a predicted answer is evaluated in terms of accuracy and informativeness against a set of multi-granularity answers. We present a simple methodology for enriching existing datasets with multi-granularity answers, and create GRANOLA-EQ, a multi-granularity version of the EntityQuestions dataset. We evaluate a range of decoding methods on GRANOLA-EQ, including a new algorithm, called Decoding with Response Aggregation (DRAG), that is geared towards aligning the response granularity with the model's uncertainty. Our experiments show that large language models with standard decoding tend to generate specific answers, which are often incorrect. In contrast, when evaluated on multi-granularity answers, DRAG yields a nearly 20 point increase in accuracy on average, which further increases for rare entities. Overall, this reveals that standard evaluation and decoding schemes may significantly underestimate the knowledge encapsulated in LMs.

Related papers

ReAG: Reasoning-Augmented Generation for Knowledge-based Visual Question Answering [54.72902502486611]
ReAG is a Reasoning-Augmented Multimodal RAG approach that combines coarse- and fine-grained retrieval with a critic model that filters irrelevant passages.<n>ReAG significantly outperforms prior methods, improving answer accuracy and providing interpretable reasoning grounded in retrieved evidence.
arXiv Detail & Related papers (2025-11-27T19:01:02Z)
KERAG: Knowledge-Enhanced Retrieval-Augmented Generation for Advanced Question Answering [26.051374461832964]
Retrieval-Augmented Generation (RAG) mitigates hallucination in Large Language Models (LLMs)<n>We present KERAG, a novel KG-based RAG pipeline that enhances QA coverage by retrieving a broader subgraph likely to contain relevant information.<n> Experiments demonstrate that KERAG surpasses state-of-the-art solutions by about 7% in quality and exceeds GPT-4o (Tool) by 10-21%.
arXiv Detail & Related papers (2025-09-05T00:06:00Z)
MinosEval: Distinguishing Factoid and Non-Factoid for Tailored Open-Ended QA Evaluation with LLMs [15.278241998033822]
Open-ended question answering (QA) is a key task for evaluating the capabilities of large language models (LLMs)<n>We propose textbfMinosEval, a novel evaluation method that first distinguishes open-ended questions and then ranks candidate answers.
arXiv Detail & Related papers (2025-06-18T07:49:13Z)
Right Answer, Wrong Score: Uncovering the Inconsistencies of LLM Evaluation in Multiple-Choice Question Answering [78.89231943329885]
One of the most widely used tasks to evaluate Large Language Models (LLMs) is Multiple-Choice Question Answering (MCQA) In this work, we shed light on the inconsistencies of MCQA evaluation strategies, which can lead to inaccurate and misleading model comparisons.
arXiv Detail & Related papers (2025-03-19T08:45:03Z)
RAG-ConfusionQA: A Benchmark for Evaluating LLMs on Confusing Questions [52.33835101586687]
Conversational AI agents use Retrieval Augmented Generation (RAG) to provide verifiable document-grounded responses to user inquiries. This paper presents a novel synthetic data generation method to efficiently create a diverse set of context-grounded confusing questions from a given document corpus.
arXiv Detail & Related papers (2024-10-18T16:11:29Z)
RAG-QA Arena: Evaluating Domain Robustness for Long-form Retrieval Augmented Question Answering [61.19126689470398]
Long-form RobustQA (LFRQA) is a new dataset covering 26K queries and large corpora across seven different domains. We show via experiments that RAG-QA Arena and human judgments on answer quality are highly correlated. Only 41.3% of the most competitive LLM's answers are preferred to LFRQA's answers, demonstrating RAG-QA Arena as a challenging evaluation platform for future research.
arXiv Detail & Related papers (2024-07-19T03:02:51Z)
Multi-LLM QA with Embodied Exploration [55.581423861790945]
We investigate the use of Multi-Embodied LLM Explorers (MELE) for question-answering in an unknown environment. Multiple LLM-based agents independently explore and then answer queries about a household environment. We analyze different aggregation methods to generate a single, final answer for each query.
arXiv Detail & Related papers (2024-06-16T12:46:40Z)
Accurate and Nuanced Open-QA Evaluation Through Textual Entailment [4.762213968673381]
We propose to study the entailment relations of answers to identify more informative and more general system answers. The entailment-based evaluation we propose allows the assignment of bonus or partial marks by quantifying the inference gap between answers.
arXiv Detail & Related papers (2024-05-26T21:33:27Z)
SQUARE: Automatic Question Answering Evaluation using Multiple Positive and Negative References [73.67707138779245]
We propose a new evaluation metric: SQuArE (Sentence-level QUestion AnsweRing Evaluation) We evaluate SQuArE on both sentence-level extractive (Answer Selection) and generative (GenQA) QA systems.
arXiv Detail & Related papers (2023-09-21T16:51:30Z)
Answering Ambiguous Questions via Iterative Prompting [84.3426020642704]
In open-domain question answering, due to the ambiguity of questions, multiple plausible answers may exist. One approach is to directly predict all valid answers, but this can struggle with balancing relevance and diversity. We present AmbigPrompt to address the imperfections of existing approaches to answering ambiguous questions.
arXiv Detail & Related papers (2023-07-08T04:32:17Z)
Mastering the ABCDs of Complex Questions: Answer-Based Claim Decomposition for Fine-grained Self-Evaluation [9.776667356119352]
We propose answer-based claim decomposition (ABCD), a prompting strategy that decomposes questions into true/false claims. Using the decomposed ABCD claims, we perform fine-grained self-evaluation. We find that GPT-3.5 has some ability to determine to what extent its answer satisfies the criteria of the input question.
arXiv Detail & Related papers (2023-05-24T05:53:11Z)
Answering Ambiguous Questions through Generative Evidence Fusion and Round-Trip Prediction [46.38201136570501]
We present a model that aggregates and combines evidence from multiple passages to adaptively predict a single answer or a set of question-answer pairs for ambiguous questions. Our model, named Refuel, achieves a new state-of-the-art performance on the AmbigQA dataset, and shows competitive performance on NQ-Open and TriviaQA.
arXiv Detail & Related papers (2020-11-26T05:48:55Z)
ClarQ: A large-scale and diverse dataset for Clarification Question Generation [67.1162903046619]
We devise a novel bootstrapping framework that assists in the creation of a diverse, large-scale dataset of clarification questions based on postcomments extracted from stackexchange. We quantitatively demonstrate the utility of the newly created dataset by applying it to the downstream task of question-answering. We release this dataset in order to foster research into the field of clarification question generation with the larger goal of enhancing dialog and question answering systems.
arXiv Detail & Related papers (2020-06-10T17:56:50Z)

This list is automatically generated from the titles and abstracts of the papers in this site.