Related papers: NeoQA: Evidence-based Question Answering with Generated News Events

NeoQA: Evidence-based Question Answering with Generated News Events

URL: http://arxiv.org/abs/2505.05949v1
Date: Fri, 09 May 2025 10:51:29 GMT
Title: NeoQA: Evidence-based Question Answering with Generated News Events
Authors: Max Glockner, Xiang Jiang, Leonardo F. R. Ribeiro, Iryna Gurevych, Markus Dreyer,
Abstract summary: We introduce NeoQA, a benchmark designed to address this issue.<n>We propose our dataset as a new platform for evaluating evidence-based question answering.
Score: 53.85274258429368
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Evaluating Retrieval-Augmented Generation (RAG) in large language models (LLMs) is challenging because benchmarks can quickly become stale. Questions initially requiring retrieval may become answerable from pretraining knowledge as newer models incorporate more recent information during pretraining, making it difficult to distinguish evidence-based reasoning from recall. We introduce NeoQA (News Events for Out-of-training Question Answering), a benchmark designed to address this issue. To construct NeoQA, we generated timelines and knowledge bases of fictional news events and entities along with news articles and Q\&A pairs to prevent LLMs from leveraging pretraining knowledge, ensuring that no prior evidence exists in their training data. We propose our dataset as a new platform for evaluating evidence-based question answering, as it requires LLMs to generate responses exclusively from retrieved evidence and only when sufficient evidence is available. NeoQA enables controlled evaluation across various evidence scenarios, including cases with missing or misleading details. Our findings indicate that LLMs struggle to distinguish subtle mismatches between questions and evidence, and suffer from short-cut reasoning when key information required to answer a question is missing from the evidence, underscoring key limitations in evidence-based reasoning.

Related papers

ELOQ: Resources for Enhancing LLM Detection of Out-of-Scope Questions [52.33835101586687]
We study out-of-scope questions, where the retrieved document appears semantically similar to the question but lacks the necessary information to answer it.<n>We propose a guided hallucination-based approach ELOQ to automatically generate a diverse set of out-of-scope questions from post-cutoff documents.
arXiv Detail & Related papers (2024-10-18T16:11:29Z)
Utilize the Flow before Stepping into the Same River Twice: Certainty Represented Knowledge Flow for Refusal-Aware Instruction Tuning [68.57166425493283]
Refusal-Aware Instruction Tuning (RAIT) enables Large Language Models (LLMs) to refuse to answer unknown questions.<n>This crude approach can cause LLMs to excessively refuse answering questions they could have correctly answered.<n>We introduce Certainty Represented Knowledge Flow for Refusal-Aware Instructions Tuning (CRaFT) to address this issue.
arXiv Detail & Related papers (2024-10-09T14:12:51Z)
Evidence-Enhanced Triplet Generation Framework for Hallucination Alleviation in Generative Question Answering [41.990482015732574]
We propose a novel evidence-enhanced triplet generation framework, EATQA, to predict all the combinations of (Question, Evidence, Answer) triplet. We bridge the distribution gap to distill the knowledge from evidence in inference stage. Our framework ensures the model to learn the logical relation between query, evidence and answer, which simultaneously improves the evidence generation and query answering.
arXiv Detail & Related papers (2024-08-27T13:07:07Z)
Where is the answer? Investigating Positional Bias in Language Model Knowledge Extraction [36.40833517478628]
Large language models require updates to remain up-to-date or adapt to new domains.<n>One key is memorizing the latest information in a way that the memorized information is extractable with a query prompt.<n>Despite minimizing document perplexity during fine-tuning, LLMs struggle to extract information through a prompt sentence.
arXiv Detail & Related papers (2024-02-16T06:29:16Z)
Open-Set Knowledge-Based Visual Question Answering with Inference Paths [79.55742631375063]
The purpose of Knowledge-Based Visual Question Answering (KB-VQA) is to provide a correct answer to the question with the aid of external knowledge bases. We propose a new retriever-ranker paradigm of KB-VQA, Graph pATH rankER (GATHER for brevity) Specifically, it contains graph constructing, pruning, and path-level ranking, which not only retrieves accurate answers but also provides inference paths that explain the reasoning process.
arXiv Detail & Related papers (2023-10-12T09:12:50Z)
FreshLLMs: Refreshing Large Language Models with Search Engine Augmentation [92.43001160060376]
We study the factuality of large language models (LLMs) in the context of answering questions that test current world knowledge. We introduce FreshQA, a novel dynamic QA benchmark encompassing a diverse range of question and answer types. We benchmark a diverse array of both closed and open-source LLMs under a two-mode evaluation procedure that allows us to measure both correctness and hallucination. Motivated by these results, we present FreshPrompt, a simple few-shot prompting method that substantially boosts the performance of an LLM on FreshQA.
arXiv Detail & Related papers (2023-10-05T00:04:12Z)
Language Models are Causal Knowledge Extractors for Zero-shot Video Question Answering [60.93164850492871]
Causal Video Question Answering (CVidQA) queries not only association or temporal relations but also causal relations in a video. We propose a novel framework, Causal Knowledge Extraction from Language Models (CaKE-LM), leveraging causal commonsense knowledge from language models to tackle CVidQA. CaKE-LM significantly outperforms conventional methods by 4% to 6% of zero-shot CVidQA accuracy on NExT-QA and Causal-VidQA datasets.
arXiv Detail & Related papers (2023-04-07T17:45:49Z)
Do Answers to Boolean Questions Need Explanations? Yes [11.226970608525596]
We release a new set of annotations marking the evidence in existing TyDi QA and BoolQ datasets. We show that our annotations can be used to train a model that extracts improved evidence spans.
arXiv Detail & Related papers (2021-12-14T22:40:28Z)
Hurdles to Progress in Long-form Question Answering [34.805039943215284]
We show that the task formulation raises fundamental challenges regarding evaluation and dataset creation. We first design a new system that relies on sparse attention and contrastive retriever learning to achieve state-of-the-art performance.
arXiv Detail & Related papers (2021-03-10T20:32:30Z)
REM-Net: Recursive Erasure Memory Network for Commonsense Evidence Refinement [130.8875535449478]
REM-Net is equipped with a module to refine the evidence by erasing the low-quality evidence that does not explain the question answering. Instead of retrieving evidence from existing knowledge bases, REM-Net leverages a pre-trained generative model to generate candidate evidence customized for the question. The results demonstrate the performance of REM-Net and show that the refined evidence is explainable.
arXiv Detail & Related papers (2020-12-24T10:07:32Z)

This list is automatically generated from the titles and abstracts of the papers in this site.